Preparing the data

The data is a subset of the Stanford Large Movie Review dataset, originally published in:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

This data is available to download at http://ai.stanford.edu/~amaas/data/sentiment/, provided proper credit is given to the original paper. This is the raw data, but you can find a preprocessed version in Kaggle, https://www.kaggle.com/c/word2vec-nlp-tutorial/data.

Let's begin with loading the data:

df <- read.csv("./data/labeledTrainData.tsv", encoding = "utf-8", quote = "", sep="	", stringsAsFactors = F)
text <- df$review

We revisit the tm library. Note that the code is a bit different, since we are loading the text as VCorpus instead of Corpus as before:

library(tm)
corpus <- VCorpus(VectorSource(text))
inspect(corpus[[1]])

Which yields the first review in this data:

> inspect(corpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 1681

 stuff going moment mj ive started listening music watching odd documentary watched wiz watched moonwalker maybe just want get certain insight guy thought really cool eighties just maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mjs feeling towards press also obvious message drugs bad mkaybr br visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans say made fans true really nice himbr br actual feature film bit finally starts 20 minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pescis character ranted wanted people know supplying drugs etc dunno maybe just hates mjs musicbr br lots ... <truncated>

First, some preprocessing:

corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords("english"))

The next step is to create bigrams. We saw that this is important:

BigramTokenizer <- function(x){ unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer))
dtm <- removeSparseTerms(dtm, 0.995)
X <- as.data.frame(as.matrix(dtm))
X$sentiment <- df$sentiment
X$sentiment <- ifelse(X$sentiment<0.5,0,1)

We are now ready to apply this data to a classification model; for instance, logistic regression.

Table of Contents for Preparing the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing the data