We need to prepare our data for the algorithm.
First, a few imports that will be necessary:
library(plyr)
library(dplyr)
library(text2vec)
library(tidytext)
library(caret)
We will use the IMDb data as before:
imdb <- read.csv("./data/labeledTrainData.tsv", encoding = "utf-8", quote = "", sep=" ", stringsAsFactors = F)
And create an iterator over the tokens:
tokens <- space_tokenizer(imdb$review)
token_iterator <- itoken(tokens)
The tokens are simple words, also known as unigrams. This constitutes our vocabulary:
vocab <- create_vocabulary(token_iterator)
It's important for the co-occurrence matrix to include only words that appear frequently together a significant amount of times. We will set this threshold to 5:
vocab <- prune_vocabulary(vocab, term_count_min = 5)
We use our filtered vocabulary:
vectorizer <- vocab_vectorizer(vocab)
And set up a window of size 5 for context words:
tcm <- create_tcm(token_iterator, vectorizer, skip_grams_window = 5)
Now that we have the co-occurrence matrix, let's continue with the vector embedding.