Data preprocessing

We need to prepare our data for the algorithm.

First, a few imports that will be necessary:

library(plyr)
library(dplyr)
library(text2vec)
library(tidytext)
library(caret)

We will use the IMDb data as before:

imdb <- read.csv("./data/labeledTrainData.tsv", encoding = "utf-8", quote = "", sep="	", stringsAsFactors = F)

And create an iterator over the tokens:

tokens <- space_tokenizer(imdb$review)
token_iterator <- itoken(tokens)

The tokens are simple words, also known as unigrams. This constitutes our vocabulary:

vocab <- create_vocabulary(token_iterator)

It's important for the co-occurrence matrix to include only words that appear frequently together a significant amount of times. We will set this threshold to 5:

vocab <- prune_vocabulary(vocab, term_count_min = 5)

We use our filtered vocabulary:

vectorizer <- vocab_vectorizer(vocab)

And set up a window of size 5 for context words:

tcm <- create_tcm(token_iterator, vectorizer, skip_grams_window = 5)

Now that we have the co-occurrence matrix, let's continue with the vector embedding.

Table of Contents for Data preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preprocessing