From text to matrix representation — the Enron dataset

The Enron email dataset contains approximately 500,000 emails generated by employees of the Enron Corporation. It was obtained by the Federal Energy Regulatory Commission of the United States during its investigation of Enron's collapse. The Enron corporation was an American energy company based in Houston, Texas, that was involved in an accountant fraud scandal that eventually led to its bankruptcy. We will use a subset as an example, but you can access the full dataset (500,000 emails) from Kaggle, (https://www.kaggle.com/wcukierski/enron-email-dataset) or from the Computer Science school in Carnegie Mellon University (https://www.cs.cmu.edu/~./enron/). 

For text mining, we will use the packages, tm (https://cran.r-project.org/web/packages/tm/index.html) and SnowballC (https://cran.r-project.org/web/packages/SnowballC/index.html). Be sure to install them before:

install.packages("tm")
install.packages("SnowballC")

We start by loading the dataframe in our workspace. We will omit some of the preprocessing stems and assume that your dataframe has two columns, email and responsive. We hand-label the responsive column for our small sample, if not available from the original data (not all versions have it). Responsive means, in legal terms, whether the email is relevant to the fraud investigation:

df <- read.csv("./data/enron.csv")
names(df)
[1] "email" "responsive"

We load the tm library and create a corpus object from the email column:

library(tm)
corpus <- Corpus(VectorSource(df$email))

We can access each email with the inspect command, as follows:

inspect(corpus[[1]])

A series of transformations are applied to our data before modeling: converting to lower case, remove punctuation, stop words and stemming:

corpus <- tm_map(corpus,tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)

Once this is done, we are ready to obtain a matrix representation of the documents, as follows:

dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.97)
X <- as.data.frame(as.matrix(dtm))
X$responsive <- df$responsive

We create the train/test split. For this, we can use the library caTools:

# Train, test, split
library(caTools)
set.seed(42)
spl <- sample.split(X$responsive, 0.7)
train <- subset(X, spl == TRUE)
test <- subset(X, spl == FALSE)
train <- subset(train, responsive==0)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset