From words to vectors

We are ready to create the word embedding using GloVe. First, let's initialize an instance of the GlobalVectors class:

glove <- GlobalVectors$new(word_vectors_size = 50, 
vocabulary = vocab,
x_max = 10)

We now apply the fit_transform method (scikit learn users might be familiar with it):

wv_main <- glove$fit_transform(tcm, 
n_iter = 10,
convergence_tol = 0.01)

And once this is done, we have our vectorizer ready. We now need to parse our text: 

text <- unlist(imdb$review)
length(text)
# 25000
text_df <- data_frame(line = 1:25000, text = text)

And apply the unnest_tokens functions from tidytext to turn our data in a tidy format:

text_df <- text_df %>%
unnest_tokens(word, text)
head(text_df)

This gives a familiar output:

 head(text_df)
# A tibble: 6 x 2
line word
<int> <chr>
1 1 with
2 1 all
3 1 this
4 1 stuff
5 1 going
6 1 down

But wait, what about the GloVe? Let's take a look:

head(wv_main[,1:3])
[,1] [,2] [,3]
overpowered 0.03408282 -0.225022092 0.077734992
nears 0.65971708 -0.005281781 -0.100175403
producers) 0.46528772 0.063937798 -0.165794402
Daddy, 0.06035958 -0.076200403 0.008196513
rhetoric, -0.05500082 0.149410397 -0.314875215
Johnsons' 0.43385875 0.078220785 -0.177165091

Actually the text2vec package returns two objects:

wv_context <- glove$components

We can use either wv_main or wv_context as our vector embedding, but it sometimes helps (according to the GloVe paper) to put them together. So, you can create a wv object as the sum or average of these two vectors, for instance:

wv <- wv_main + t(wv_context).

Let's use only the wv_main for now. We need to coerce the matrix to a data frame format and add row.names as a column, to join it with our text, in tidy format:

wv <- as.data.frame(wv_main)
wv$word <- row.names(wv)

And finally, put these two together:

df <-  wv%>% inner_join(text_df)

This is still not ready to use as we need to aggregate the vectors as they appear in the review. One possibility is to just take the average vector as representative; the vector of the review would be the average vector of all the words that compose the review. We will take this approach here and suggest some other possibilities in the exercises:

df <- df %>% 
group_by(line) %>%
summarize_all(mean) %>%
select(1:50)
df$label <- as.factor(imdb$sentiment)

This data is now ready for passing through different classifiers, from which we can predict the sentiment polarity (positive/negative).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset