Word embeddings

The order of words in a text matters. Therefore, we can expect higher performance if we do not just look at texts in aggregate but see them as a sequence. This section makes use of a lot of the techniques discussed in the previous chapter; however, here we're going to add a critical ingredient, word vectors.

Words and word tokens are categorical features. As such, we cannot directly feed them into a neural network. Previously, we have dealt with categorical data by turning it into one-hot encoded vectors. Yet for words, this is impractical. Since our vocabulary is 10,000 words, each vector would contain 10,000 numbers that are all zeros except for one. This is highly inefficient, so instead, we will use an embedding.

In practice, embeddings work like a lookup table. For each token, they store a vector. When the token is given to the embedding layer, it returns the vector for that token and passes it through the neural network. As the network trains, the embeddings get optimized as well.

Remember that neural networks work by calculating the derivative of the loss function with respect to the parameters (weights) of the model. Through backpropagation, we can also calculate the derivative of the loss function with respect to the input of the model. Thus we can optimize the embeddings to deliver ideal inputs that help our model.

Preprocessing for training with word vectors

Before we start with training word embeddings, we need to do some preprocessing steps. Namely, we need to assign each word token a number and create a NumPy array full of sequences.

Assigning numbers to tokens makes the training process smoother and decouples the tokenization process from the word vectors. Keras has a Tokenizer class, which can create numeric tokens for words. By default, this tokenizer splits text by spaces. While this works mostly fine in English, it can be problematic and cause issues in other languages. A key learning point to take away is that it's better to tokenize the text with spaCy first, as we already did for our two previous methods, and then assign numeric tokens with Keras.

The Tokenizer class also allows us to specify how many words we want to consider, so once again we will only use the 10,000 most used words, which we can specify by running:

from keras.preprocessing.text import Tokenizer
import numpy as np

max_words = 10000

The tokenizer works a lot like CountVectorizer from sklearn. First, we create a new tokenizer object. Then we fit the tokenizer, and finally, we can transform the text into tokenized sequences:

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['joint_lemmas'])
sequences = tokenizer.texts_to_sequences(df['joint_lemmas'])

The sequences variable now holds all of our texts as numeric tokens. We can look up the mapping of words to numbers from the tokenizer's word index with the following code:

word_index = tokenizer.word_index
print('Token for "the"',word_index['the'])
print('Token for "Movie"',word_index['movie'])
Token for "the" 4
Token for "Movie" 333

As you can see, frequently used words such as "the" have lower token numbers than less frequent words such as "movie." You can also see that word_index is a dictionary. If you are using your model in production, you can save this dictionary to disk in order to convert words into tokens at a later time.

Finally, we need to turn our sequences into sequences of equal length. This is not always necessary, as some model types can deal with sequences of different lengths, but it usually makes sense and is often required. We will examine which models need equal length sequences in the next section on building custom NLP models.

Keras' pad_sequences function allows us to easily bring all of the sequences to the same length by either cutting off sequences or adding zeros at the end. We will bring all the tweets to a length of 140 characters, which for a long time was the maximum length tweets could have:

from keras.preprocessing.sequence import pad_sequences

maxlen = 140

data = pad_sequences(sequences, maxlen=maxlen)

Finally, we split our data into a training and validation set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, df['relevant'],test_size = 0.2, shuffle=True, random_state = 42)

Now we are ready to train our own word vectors.

Embeddings are their own layer type in Keras. To use them, we have to specify how large we want the word vectors to be. The 50-dimensional vector that we have chosen to use is able to capture good embeddings even for quite large vocabularies. Additionally, we also have to specify how many words we want embeddings for and how long our sequences are. Our model is now a simple logistic regressor that trains its own embeddings:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = 50

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

Notice how we do not have to specify an input shape. Even specifying the input length is only necessary if the following layers require knowledge of the input length. Dense layers require knowledge about the input size, but since we are using dense layers directly, we need to specify the input length here.

Word embeddings have many parameters. This is something you can see if you are printing out the models summary:

model.summary()
_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding_2 (Embedding)      (None, 140, 50)           500000    _________________________________________________________________flatten_2 (Flatten)          (None, 7000)              0         _________________________________________________________________dense_3 (Dense)              (None, 1)                 7001      =================================================================Total params: 507,001Trainable params: 507,001Non-trainable params: 0
_________________________________________________________________

As you can see, the embedding layer has 50 parameters for 10,000 words equaling 500,000 parameters in total. This makes training slower and can increase the chance of overfitting.

The next step is for us to compile and train our model as usual:

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
              
history = model.fit(X_train, y_train,epochs=10,batch_size=32,validation_data=(X_test, y_test))

This model achieves about 76% accuracy on the test set but over 90% accuracy on the training set. However, the large number of parameters in the custom embeddings has led us to overfitting. To avoid overfitting and reduce training time, it's often better to use pretrained word embeddings.

Loading pretrained word vectors

Like in computer vision, NLP models can benefit from using pretrained pieces of other models. In this case, we will use the pretrained GloVe vectors. GloVe stands for Global Vectors for Word 8 and is a project of the Stanford NLP group. GloVe provides different sets of vectors trained in different texts.

In this section, we will be using word embeddings trained on Wikipedia texts as well as the Gigaword dataset. In total, the vectors were trained on a text of 6 billion tokens.

With all that being said, there are alternatives to GloVe, such as Word2Vec. Both GloVe and Word2Vec are relatively similar, although the training method for them is different. They each have their strengths and weaknesses, and in practice it is often worth trying out both.

A nice feature of GloVe vectors is that they encode word meanings in vector space so that "word algebra" becomes possible. The vector for "king" minus the vector for "man" plus the vector for "woman," for example, results in a vector pretty close to "queen." This means the differences between the vectors for "man" and "woman" are the same as the differences for the vectors of "king" and "queen," as the differentiating features for both are nearly the same.

Equally, words describing similar things such as "frog" and "toad" are very close to each other in the GloVe vector space. Encoding semantic meanings in vectors offer a range of other exciting opportunities for document similarity and topic modeling, as we will see later in this chapter. Semantic vectors are also pretty useful for a wide range of NLP tasks, such as our text classification problem.

The actual GloVe vectors are in a text file. We will use the 50-dimensional embeddings trained on 6 billion tokens. To do this, we need to open the file:

import os
glove_dir = '../input/glove6b50d'
f = open(os.path.join(glove_dir, 'glove.6B.50d.txt'))

Then we create an empty dictionary that will later map words to embeddings:

embeddings_index = {}

In the dataset, each line represents a new word embedding. The line starts with the word, and the embedding values follow. We can read out the embeddings like this:

for line in f:                                            #1
    values = line.split()                                 #2
    word = values[0]                                      #3
    embedding = np.asarray(values[1:], dtype='float32')   #4
    embeddings_index[word] = embedding dictionary         #5
f.close()                                                 #6

But what does that mean? Let's take a minute to break down the meaning behind the code, which has six key elements:

  1. We loop over all lines in the file. Each line contains a word and embedding.
  2. We split the line by whitespace.
  3. The first thing in the line is always the word.
  4. Then come the embedding values. We immediately transform them into a NumPy array and make sure that they are all floating-point numbers, that is, decimals.
  5. We then save the embedding vector in our embedding dictionary.
  6. Once we are done with it, we close the file.

As a result of running this code, we now have a dictionary mapping words to their embeddings:

print('Found %s word vectors.' % len(embeddings_index))
Found 400000-word vectors.

This version of GloVe has vectors for 400,000 words, which should be enough to cover most of the words that we will encounter. However, there might be some words where we still do not have a vector. For these words, we will just create random vectors. To make sure these vectors are not too far off, it is a good idea to use the same mean and standard deviation for the random vectors as from the trained vectors.

To this end, we need to calculate the mean and standard deviation for the GloVe vectors:

all_embs = np.stack(embeddings_index.values())
emb_mean = all_embs.mean()
emb_std = all_embs.std()

Our embedding layer will be a matrix with a row for each word and a column for each element of the embedding. Therefore, we need to specify how many dimensions one embedding has. The version of GloVe we loaded earlier has 50-dimensional vectors:

embedding_dim = 50

Next, we need to find out how many words we actually have. Although we have set the maximum to 10,000, there might be fewer words in our corpus. At this point, we also retrieve the word index from the tokenizer, which we will use later:

word_index = tokenizer.word_index
nb_words = min(max_words, len(word_index))

To create our embedding matrix, we first create a random matrix with the same mean and std as the embeddings:

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embedding_dim))

Embedding vectors need to be in the same position as their token number. A word with token 1 needs to be in row 1 (rows start with zero), and so on. We can now replace the random embeddings for the words for which we have trained embeddings:

for word, i in word_index.items():                    #1
    if i >= max_words:                                #2
        continue  
    embedding_vector = embeddings_index.get(word)     #3
    if embedding_vector is None:                      #4
        embedding_matrix[i] = embedding_vector

This command has four key elements that we should explore in more detail before we move on:

  1. We loop over all the words in the word index.
  2. If we are above the number of words we want to use, we do nothing.
  3. We get the embedding vector for the word. This operation might return none if there is no embedding for this word.
  4. If there is an embedding vector, we put it in the embedding matrix.

To use the pretrained embeddings, we just have to set the weights in the embedding layer to the embedding matrix that we just created. To make sure the carefully created weights are not destroyed, we are going to set the layer to be non-trainable, which we can achieve by running the following:

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights = [embedding_matrix], trainable = False))
                    
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

This model can be compiled and trained just like any other Keras model. You will notice that it trains much faster than the model in which we trained our own embeddings and suffers less from overfitting. However, the overall performance on the test set is roughly the same.

Word embeddings are pretty cool in reducing training time and helping to build accurate models. However, semantic embeddings go further. They can, for example, be used to measure how similar two texts are on a semantical level, even if they include different words.

Time series models with word vectors

Text is a time series. Different words follow each other and the order in which they do matters. Therefore, every neural network-based technique from the previous chapter can also be used for NLP. In addition, there are some building blocks that were not introduced in Chapter 4, Understanding Time Series that are useful for NLP.

Let's start with an LSTM, otherwise known as long short-term memory. All you have to change from the implementation in the last chapter is that the first layer of the network should be an embedding layer. This example below uses a CuDNNLSTM layer, which trains much faster than a regular LSTM layer.

Other than this, the layer remains the same. If you do not have a GPU, replace CuDNNLSTM with LSTM:

from keras.layers import CuDNNLSTM
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights = [embedding_matrix], trainable = False))
model.add(CuDNNLSTM(32))
model.add(Dense(1, activation='sigmoid'))

One technique used frequently in NLP but less frequently in time series forecasting is a bidirectional recurrent neural network (RNN). A bidirectional RNN is effectively just two RNNs where one gets fed the sequence forward, while the other one gets fed the sequence backward:

Time series models with word vectors

A bidirectional RNN

In Keras, there is a Bidirectional layer that we can wrap any RNN layer around, such as an LSTM. We achieve this in the following code:

from keras.layers import Bidirectional
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights = [embedding_matrix], trainable = False))
model.add(Bidirectional(CuDNNLSTM(32)))
model.add(Dense(1, activation='sigmoid'))

Word embeddings are great because they enrich neural networks. They are a space-efficient and powerful method that allows us to transform words into numbers that a neural network can work with. With that being said, there are more advantages to encoding semantics as vectors, such as how we can perform vector math on them! This is useful if we want to measure the similarity between two texts, for instance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset