The importance of data cleansing

If you follow the preceding workflow, and stop from time to time to see the results (which you absolutely should, by the way) you will notice that there is a lot of garbage around—words with upper and lower case, punctuation and so on. What happens if you improve this workflow by properly parsing the words? You can use the tokenizers library instead of the space_tokenizer function from text2vec to remove stopwords and punctuation in a single line:

 library(tokenizers)
tokens <- tokenize_words(imdb$review, stopwords = stopwords())

The full code is now:

library(plyr)
library(dplyr)
library(text2vec)
library(tidytext)
library(caret)

imdb <- read.csv("./data/labeledTrainData.tsv"
, encoding = "utf-8"
, quote = ""
, sep=" "
, stringsAsFactors = F)
# Standard preprocessing: change to lowercase, remove english stopwords and punctuation
library(tokenizers)
tokens <- tokenize_words(imdb$review, stopwords = stopwords())


# Create vocabulary. The tokens are simple words here.
token_iterator <- itoken(tokens)
vocab <- create_vocabulary(token_iterator)

# Kill sparse terms
vocab <- prune_vocabulary(vocab, term_count_min = 5L)

vectorizer <- vocab_vectorizer(vocab)

# use window of 5 for context words
tcm <- create_tcm(token_iterator, vectorizer, skip_grams_window = 5L)

glove <- GlobalVectors$new(word_vectors_size = 50,
vocabulary = vocab,
x_max = 10)
wv_main <- glove$fit_transform(tcm,
n_iter = 10,
convergence_tol = 0.05)

text <- unlist(imdb$review)

text_df <- data_frame(line = 1:25000, text = text)

text_df <- text_df %>%
unnest_tokens(word, text)

wv <- as.data.frame(wv_main)

wv$word <- row.names(wv)

df <- wv%>% inner_join(text_df)

# Now we need to create the trained matrix
df <- df %>% group_by(line) %>% summarize_all(mean) %>% select(1:51)
df$label <- as.factor(imdb$sentiment)

library(caret)

control <- trainControl(method="cv", repeats=5)


# Train the different models
set.seed(7)
modelRF <- train(label~., data=df, method="rf", trControl=control)

set.seed(7)
modelGbm <- train(label~., data=df, method="gbm", trControl=control, verbose=FALSE)

set.seed(7)
modelLogitBoost <- train(label~., data=df, method="LogitBoost", trControl=control)

set.seed(7)
modelNaiveBayes <- train(label~., data=df, method="nb", trControl=control)

# collect resamples: this is useful for the plots
results <- resamples(
list(RF=modelRF,
GBM=modelGbm,
LB=modelLogitBoost,
NB=modelNaiveBayes))


# summarize and check the model performance
summary(results)
bwplot(results)
dotplot(results)

We can see a significant improvement in the results!

> summary(results)
Call:
summary.resamples(object = results)
Models: RF, GBM, LB, NB
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7820 0.7892 0.7946 0.79340 0.7972 0.8012 0
GBM 0.7904 0.7952 0.7978 0.79732 0.7996 0.8036 0
LB 0.6904 0.6978 0.7040 0.70388 0.7098 0.7176 0
NB 0.6728 0.6810 0.6900 0.68824 0.6957 0.7008 0

The dotplot and boxplot here are shown as follows:

Dotplot showing the different classifiers after proper preprocessing

And the boxplot:

Boxplot showing the different classifiers after proper preprocessing

Remember, when it comes to software in general, and data science is not the exception:

Garbage in, garbage out: Always pay attention to the data you are feeding to the models! The most powerful models available will produce unsatisfactory results when feeding them with the wrong data! It's important to log, or at least print, to the console, any preprocessing steps you are doing. Treating models as a magic black box is a very dangerous thing, and unfortunately too common. 

As we mentioned earlier, vector embeddings is not proper deep learning, rather a feature representation method. However, we can combine vector embeddings with deep neural networks and, hopefully, get better results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset