Data analysis and preprocessing

Now, let's move on to the actual implementation where we need to load the data. Keras actually has a functionality that can be used to load this sentiment dataset from IMDb, but the problem is that it has already mapped all the words to integer tokens. This is such an essential part of working with natural human language insight neural networks that I really want to show you how to do it.

Also, if you want to use this code for sentiment analysis of whatever data you might have in some other language, you will need to do this yourself, so we have just quickly implemented some functions for downloading this dataset.

Let's start off by importing a bunch of required packages:

%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

And then we load the dataset:

import imdb
imdb.maybe_download_and_extract()

Output:
- Download progress: 100.0%
Download finished. Extracting files.
Done.

input_text_train, target_train = imdb.load_data(train=True)
input_text_test, target_test = imdb.load_data(train=False)

print("Size of the trainig set: ", len(input_text_train))
print("Size of the testing set:  ", len(input_text_test))

Output:
Size of the trainig set: 25000
Size of the testing set: 25000

As you can see, it has 25,000 texts in the training set and in the testing set.

Let's just see one example from the training set and how it looks:

#combine dataset
text_data = input_text_train + input_text_test
input_text_train[1]

Output:
'This is a really heart-warming family movie. It has absolutely brilliant animal training and "acting" (if you can call it like that) as well (just think about the dog in "How the Grinch stole Christmas"... it was plain bad training). The Paulie story is extremely well done, well reproduced and in general the characters are really elaborated too. Not more to say except that this is a GREAT MOVIE!<br /><br />My ratings: story 8.5/10, acting 7.5/10, animals+fx 8.5/10, cinematography 8/10.<br /><br />My overall rating: 8/10 - BIG FAMILY MOVIE AND VERY WORTH WATCHING!'

target_train[1]

Output:
1.0

This is a fairly short one and the sentiment value is 1.0, which means it is a positive sentiment, so this is a positive review of whatever movie this was about.

Now, we get to the tokenizer, and this is the first step of processing this raw data because the neural network cannot work on text data. Keras has implemented what is called a tokenizer for building a vocabulary and mapping from words to an integer.

Also, we can say that we want a maximum of 10,000 words, so it will use only the 10,000 most popular words from the dataset:

num_top_words = 10000
tokenizer_obj = Tokenizer(num_words=num_top_words)

Now, we take all the text from the dataset and we call this function fit on texts:

tokenizer_obj.fit_on_texts(text_data)

The tokenizer takes about 10 seconds, and then it will have built the vocabulary. It looks like this:

tokenizer_obj.word_index

Output:
{'britains': 33206,
 'labcoats': 121364,
 'steeled': 102939,
 'geddon': 67551,
 "rossilini's": 91757,
 'recreational': 27654,
 'suffices': 43205,
 'hallelujah': 30337,
 'mallika': 30343,
 'kilogram': 122493,
 'elphic': 104809,
 'feebly': 32818,
 'unskillful': 91728,
 "'mistress'": 122218,
 "yesterday's": 25908,
 'busco': 85664,
 'goobacks': 85670,
 'mcfeast': 71175,
 'tamsin': 77763,
 "petron's": 72628,
 "'lion": 87485,
 'sams': 58341,
 'unbidden': 60042,
 "principal's": 44902,
 'minutiae': 31453,
 'smelled': 35009,
 'historyx97but': 75538,
 'vehemently': 28626,
 'leering': 14905,
 'kýnay': 107654,
 'intendend': 101260,
 'chomping': 21885,
 'nietsze': 76308,
 'browned': 83646,
 'grosse': 17645,
 "''gaslight''": 74713,
 'forseeing': 103637,
 'asteroids': 30997,
 'peevish': 49633,
 "attic'": 120936,
 'genres': 4026,
 'breckinridge': 17499,
 'wrist': 13996,
 "sopranos'": 50345,
 'embarasing': 92679,
 "wednesday's": 118413,
 'cervi': 39092,
 'felicity': 21570,
 "''horror''": 56254,
 'alarms': 17764,
 "'ol": 29410,
 'leper': 27793,
 'oncex85': 100641,
 'iverson': 66834,
 'triply': 117589,
 'industries': 19176,
 'brite': 16733,
 'amateur': 2459,
 "libby's": 46942,
 'eeeeevil': 120413,
 'jbc33': 51111,
 'wyoming': 12030,
 'waned': 30059,
 'uchida': 63203,
 'uttter': 93299,
 'irector': 123847,
 'outriders': 95156,
 'perd': 118465,
.
.
.}

So, each word is now associated with an integer; therefore, the word the has number 1:

tokenizer_obj.word_index['the']

Output:
1

Here, and has number 2:

tokenizer_obj.word_index['and']

Output:
2

The word a has 3:

tokenizer_obj.word_index['a']

Output:
3

And so on. We see that movie has number 17:

tokenizer_obj.word_index['movie']

Output:
17

And film has number 19:

tokenizer_obj.word_index['film']

Output:
19

What all this means is that the was the most used word in the dataset and and was the second most used in the dataset. So, whenever we want to map words to integer tokens, we will get these numbers.

Let's try and take the word number 743 for example, and this was the word romantic:

tokenizer_obj.word_index['romantic']

Output:
743

So, whenever we see the word romantic in the input text, we map it to the token integer 743. We use the tokenizer again to convert all the words in the text in the training set into integer tokens:

input_text_train[1]
Output:
'This is a really heart-warming family movie. It has absolutely brilliant animal training and "acting" (if you can call it like that) as well (just think about the dog in "How the Grinch stole Christmas"... it was plain bad training). The Paulie story is extremely well done, well reproduced and in general the characters are really elaborated too. Not more to say except that this is a GREAT MOVIE!<br /><br />My ratings: story 8.5/10, acting 7.5/10, animals+fx 8.5/10, cinematography 8/10.<br /><br />My overall rating: 8/10 - BIG FAMILY MOVIE AND VERY WORTH WATCHING!

When we convert that text to integer tokens, it becomes an array of integers:

np.array(input_train_tokens[1])

Output:
array([ 11, 6, 3, 62, 488, 4679, 236, 17, 9, 45, 419,
        513, 1717, 2425, 2, 113, 43, 22, 67, 654, 9, 37,
         12, 14, 69, 39, 101, 42, 1, 826, 8, 85, 1,
       6418, 3492, 1156, 9, 13, 1042, 74, 2425, 1, 6419, 64,
          6, 568, 69, 221, 69, 2, 8, 825, 1, 102, 23,
         62, 96, 21, 51, 5, 131, 556, 12, 11, 6, 3,
         78, 17, 7, 7, 56, 2818, 64, 723, 447, 156, 113,
        702, 447, 156, 1598, 3611, 723, 447, 156, 633, 723, 156,
          7, 7, 56, 437, 670, 723, 156, 191, 236, 17, 2,
         52, 278, 147])

So, the word this becomes the number 11, the word is becomes the number 59, and so forth.

We also need to convert the rest of the text:

input_test_tokens = tokenizer_obj.texts_to_sequences(input_text_test)

Now, there's another problem because the sequences of tokens have different lengths depending on the length of the original text, even though the recurrent units can work with sequences of arbitrary length. But the way that TensorFlow works is that all of the data in a batch needs to have the same length.

So, we can either ensure that all sequences in the entire dataset have the same length, or write a custom data generator that ensures that the sequences in a single batch have the same length. Now, it is a lot simpler to ensure that all the sequences in the dataset have the same length, but the problem is that there are some outliers. We have some sentences that, I think, are more than 2,200 words long. It will hurt our memory very much if we have all the short sentences with more than 2,200 words. So what we will do instead is make a compromise; first, we need to count all the words, or the number of tokens in each of these input sequences. What we see is that the average number of words in a sequence is about 221:

total_num_tokens = [len(tokens) for tokens in input_train_tokens + input_test_tokens]
total_num_tokens = np.array(total_num_tokens)

#Get the average number of tokens
np.mean(total_num_tokens)

Output:
221.27716

And we see that the maximum number of words is more than 2,200:

np.max(total_num_tokens)

Output:
2208

Now, there's a huge difference between the average and the max, and again we would be wasting a lot of memory if we just padded all the sentences in the dataset so that they would all have 2208 tokens. This would especially be a problem if you have a dataset with millions of sequences of text.

So what we will do is make a compromise where we will pad all sequences and truncate the ones that are too long so that they have 544 words. The way we calculated this was like this—we took the average number of words in all the sequences in the dataset and we added two standard deviations:

max_num_tokens = np.mean(total_num_tokens) + 2 * np.std(total_num_tokens)
max_num_tokens = int(max_num_tokens)
max_num_tokens

Output:
544

What do we get out of this is? We cover about 95% of the text in the dataset, so only about 5% are longer than 544 words:

np.sum(total_num_tokens < max_num_tokens) / len(total_num_tokens)

Output:
0.94532

Now, we call these functions in Keras. They will either pad the sequences that are too short (so they will just add zeros) or truncate the sequences that are too long (basically just cut off some of the words if the text is too long).

Now, there's an important thing here: we can do this padding and truncating in pre or post mode. So imagine we have a sequence of integer tokens and we want to pad this because it's too short. We can:

Either pad all of these zeros at the beginning so that we have the actual integer tokens down at the end.
Or do it in the opposite way so that we have all this data at the beginning and then all the zeros at the end. But if we just go back and look at the preceding RNN flowchart, remember that it is processing the sequence one step at a time so if we start processing zeros, it will probably not mean anything and the internal state would have probably just remain zero. So, whenever it finally sees an integer token for a specific word, it will know okay now we start processing the data.

However, if all the zeros were at the end, we would have started processing all the data; then we'd have some internal state inside the recurrent unit. Right now, we see a whole lot of zeros, so that might actually destroy the internal state that we have just calculated. This is why it might be a good idea to have the zeros padded at the beginning.

But the other problem is when we truncate a text, so if the text is very long, we will truncate it to get it to fit to 544 words, or whatever the number was. Now, imagine we've caught this sentence here in the middle somewhere and it says this very good movie or this is not. You know, of course, that we do this only for very long sequences, but it is possible that we lose essential information for properly classifying this text. So it is a compromise that we're making when we are truncating input text. A better way would be to create a batch and just pad text inside that batch. So, when we see a very very long sequence, we pad the other sequences to have the same length. But we don't need to store all of this data in memory because most of it is wasted.

Let's go back and convert the entire dataset so that it is truncated and padded; thus, it's one big matrix of data:

seq_pad = 'pre'

input_train_pad = pad_sequences(input_train_tokens, maxlen=max_num_tokens,
 padding=seq_pad, truncating=seq_pad)

input_test_pad = pad_sequences(input_test_tokens, maxlen=max_num_tokens,
 padding=seq_pad, truncating=seq_pad)

We check the shape of this matrix:

input_train_pad.shape

Output:
(25000, 544)

input_test_pad.shape

Output:
(25000, 544)

So, let's have a look at specific sample tokens before and after padding:

np.array(input_train_tokens[1])

Output:
array([ 11, 6, 3, 62, 488, 4679, 236, 17, 9, 45, 419,
        513, 1717, 2425, 2, 113, 43, 22, 67, 654, 9, 37,
         12, 14, 69, 39, 101, 42, 1, 826, 8, 85, 1,
       6418, 3492, 1156, 9, 13, 1042, 74, 2425, 1, 6419, 64,
          6, 568, 69, 221, 69, 2, 8, 825, 1, 102, 23,
         62, 96, 21, 51, 5, 131, 556, 12, 11, 6, 3,
         78, 17, 7, 7, 56, 2818, 64, 723, 447, 156, 113,
        702, 447, 156, 1598, 3611, 723, 447, 156, 633, 723, 156,
          7, 7, 56, 437, 670, 723, 156, 191, 236, 17, 2,
         52, 278, 147])

And after padding, this sample will look like the following:

input_train_pad[1]

Output:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 11, 6, 3, 62, 488, 4679, 236, 17, 9,
         45, 419, 513, 1717, 2425, 2, 113, 43, 22, 67, 654,
          9, 37, 12, 14, 69, 39, 101, 42, 1, 826, 8,
         85, 1, 6418, 3492, 1156, 9, 13, 1042, 74, 2425, 1,
       6419, 64, 6, 568, 69, 221, 69, 2, 8, 825, 1,
        102, 23, 62, 96, 21, 51, 5, 131, 556, 12, 11,
          6, 3, 78, 17, 7, 7, 56, 2818, 64, 723, 447,
        156, 113, 702, 447, 156, 1598, 3611, 723, 447, 156, 633,
        723, 156, 7, 7, 56, 437, 670, 723, 156, 191, 236,
         17, 2, 52, 278, 147], dtype=int32)

Also, we need a functionality to map backwards so that it maps from integer tokens back to text words; we just need that here. It's a very simple helper function, so let's go ahead and implement it:

index = tokenizer_obj.word_index
index_inverse_map = dict(zip(index.values(), index.keys()))

def convert_tokens_to_string(input_tokens):
 
 # Convert the tokens back to words
 input_words = [index_inverse_map[token] for token in input_tokens if token != 0]
 
 # join them all words.
 combined_text = " ".join(input_words)

return combined_text

Now, for example, the original text in the dataset is like this:

input_text_train[1]
Output:

input_text_train[1]

'This is a really heart-warming family movie. It has absolutely brilliant animal training and "acting" (if you can call it like that) as well (just think about the dog in "How the Grinch stole Christmas"... it was plain bad training). The Paulie story is extremely well done, well reproduced and in general the characters are really elaborated too. Not more to say except that this is a GREAT MOVIE!<br /><br />My ratings: story 8.5/10, acting 7.5/10, animals+fx 8.5/10, cinematography 8/10.<br /><br />My overall rating: 8/10 - BIG FAMILY MOVIE AND VERY WORTH WATCHING!'

If we use a helper function to convert the tokens back to text words, we get this text:

convert_tokens_to_string(input_train_tokens[1])

'this is a really heart warming family movie it has absolutely brilliant animal training and acting if you can call it like that as well just think about the dog in how the grinch stole christmas it was plain bad training the paulie story is extremely well done well and in general the characters are really too not more to say except that this is a great movie br br my ratings story 8 5 10 acting 7 5 10 animals fx 8 5 10 cinematography 8 10 br br my overall rating 8 10 big family movie and very worth watching'

It's basically the same except for punctuation and other symbols.

Table of Contents for Data analysis and preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data analysis and preprocessing