If you follow the preceding workflow, and stop from time to time to see the results (which you absolutely should, by the way) you will notice that there is a lot of garbage around—words with upper and lower case, punctuation and so on. What happens if you improve this workflow by properly parsing the words? You can use the tokenizers library instead of the space_tokenizer function from text2vec to remove stopwords and punctuation in a single line:
library(tokenizers)
tokens <- tokenize_words(imdb$review, stopwords = stopwords())
The full code is now:
library(plyr)
library(dplyr)
library(text2vec)
library(tidytext)
library(caret)
imdb <- read.csv("./data/labeledTrainData.tsv"
, encoding = "utf-8"
, quote = ""
, sep=" "
, stringsAsFactors = F)
# Standard preprocessing: change to lowercase, remove english stopwords and punctuation
library(tokenizers)
tokens <- tokenize_words(imdb$review, stopwords = stopwords())
# Create vocabulary. The tokens are simple words here.
token_iterator <- itoken(tokens)
vocab <- create_vocabulary(token_iterator)
# Kill sparse terms
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(token_iterator, vectorizer, skip_grams_window = 5L)
glove <- GlobalVectors$new(word_vectors_size = 50,
vocabulary = vocab,
x_max = 10)
wv_main <- glove$fit_transform(tcm,
n_iter = 10,
convergence_tol = 0.05)
text <- unlist(imdb$review)
text_df <- data_frame(line = 1:25000, text = text)
text_df <- text_df %>%
unnest_tokens(word, text)
wv <- as.data.frame(wv_main)
wv$word <- row.names(wv)
df <- wv%>% inner_join(text_df)
# Now we need to create the trained matrix
df <- df %>% group_by(line) %>% summarize_all(mean) %>% select(1:51)
df$label <- as.factor(imdb$sentiment)
library(caret)
control <- trainControl(method="cv", repeats=5)
# Train the different models
set.seed(7)
modelRF <- train(label~., data=df, method="rf", trControl=control)
set.seed(7)
modelGbm <- train(label~., data=df, method="gbm", trControl=control, verbose=FALSE)
set.seed(7)
modelLogitBoost <- train(label~., data=df, method="LogitBoost", trControl=control)
set.seed(7)
modelNaiveBayes <- train(label~., data=df, method="nb", trControl=control)
# collect resamples: this is useful for the plots
results <- resamples(
list(RF=modelRF,
GBM=modelGbm,
LB=modelLogitBoost,
NB=modelNaiveBayes))
# summarize and check the model performance
summary(results)
bwplot(results)
dotplot(results)
We can see a significant improvement in the results!
> summary(results)
Call:
summary.resamples(object = results)
Models: RF, GBM, LB, NB
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7820 0.7892 0.7946 0.79340 0.7972 0.8012 0
GBM 0.7904 0.7952 0.7978 0.79732 0.7996 0.8036 0
LB 0.6904 0.6978 0.7040 0.70388 0.7098 0.7176 0
NB 0.6728 0.6810 0.6900 0.68824 0.6957 0.7008 0
The dotplot and boxplot here are shown as follows:
And the boxplot:
Remember, when it comes to software in general, and data science is not the exception:
As we mentioned earlier, vector embeddings is not proper deep learning, rather a feature representation method. However, we can combine vector embeddings with deep neural networks and, hopefully, get better results.