Implementing a benchmark – logistic regression 

Logistic regression might not be the fanciest algorithm in town, but for sure, it is one of the most commonly used. It is quite robust and powerful, yet simple to interpret. Unlike other methods, it is easy to look under the hood and see what it is doing.

First, we choose some indices for the training and testing set:

# Train, test, split
library(caTools)
set.seed(42)
spl <- sample.split(X$sentiment, 0.7)
train <- subset(X, spl == TRUE)
test <- subset(X, spl == FALSE)

And now we split into train and test sets:

X_train <- subset(train,select=-sentiment)
y_train <- train$sentiment
X_test <- subset(test,select=-sentiment)
y_test <- test$sentiment

Now let's look at the model and the coefficients:

model <- glm(y_train ~ ., data = X_train, family = "binomial")
coefs <- as.data.frame(model$coefficients)
names(coefs) <- c("value")
coefs$token <- row.names(coefs)

And see how the model is using each feature:

library(ggplot2)
library(dplyr)
coefs %>%
arrange(desc(value)) %>%
head %>%
ggplot(aes(x=token, y=value))+
geom_col()+
coord_flip()+
theme_bw()

This gives us the following chart:

Features correlated with positive sentiment

Now let's take a look at the features correlated with a negative sentiment (try to do it yourself before looking at the code!):

coefs %>% 
arrange(value) %>%
head %>%
ggplot(aes(x=token, y=value))+
geom_col()+
coord_flip()+
theme_bw()

This looks like:

Features correlated with negative sentiment

Now let's take a closer look at the performance using the ROC curve. We can write our own function instead of using an extra package:

roc <- function(y_test, y_preds){
y_test <- y_test[order(y_preds, decreasing = T)]
return(data.frame(fpr=cumsum(!y_test)/sum(!y_test),
tpr=cumsum(y_test)/sum(y_test)) )
}

We can now generate predictions and plot the curve using the base R graphics:

y_preds <- predict(model, X_test, type="response")
plot(roc(y_test,y_preds), xlim=c(0,1), ylim=c(0,1))

Or store them on a data frame and use ggplot2:

roc_df <- roc(y_test,y_preds)
ggplot(roc_df, aes(x=fpr,y=tpr))+geom_point(color="red")+theme_bw()

The ROC curve looks like this:

ROC curve for logistic regression with bigrams

We could quantify the AUC, but this is a bit more involved using base R code. You can try using the ROCR or pROC packages. We will simply set up a threshold of 0.5 and compute the precision for the positive class for that threshold, just to get a feel for what's going on:

labels <- ifelse(y_preds<0.5,0,1)
table(labels,y_test)

This gives:

table(labels,y_test)
y_test
labels 0 1
0 2536 896
1 1214 2854

So we get a precision of:

2536/(2536+896)
[1] 0.7389277

For the positive class. Not bad, in practice; 80% would be a good classifier, since this is roughly the agreement rate among humans. You can try to improve this benchmark in a few different ways, suggested in the Exercises section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset