The first step in the text classification process is to figure out how to represent the document in a manner which is suitable for classification tasks and learning algorithms. This step is basically intended to reduce the complexity of documents, making it easier to work with. While doing so, the following questions come to mind:
An attribute value representation of documents implies that the order of words in a document is not of high significance and each unique word in a document can be considered as a feature and the frequency of its occurrence is represented as a value. Further, discarding the features with a very low value or occurrence in the document can reduce the high dimensionality.
Vector space representation of words considers each word in a document as a vector. The attribute value representation may be having a Boolean form, set-of-words approach that captures the presence or absence of a word in a document, that is, if a particular word exists in a document or not. Apart from this representation, tf*idf, that is, term frequency multiplied by inverse document frequency, term weights can also be estimated and used as features.
Vector space models have some limitations too. For example, the document representation is very high dimensionally, and it may also lead to the loss of semantic relations or correlations between the different terms in the given documents. Analyzing text documents is tedious because there exists a near independence among the features; vocabulary size is huge, sometimes even larger than the number of instances. There will be certain words that occur a number of times, while there are words which will occur just a few times. How do we assign weights to the common words compared to rare words? Thus, document representation becomes an important part of this process.
Vector space models work on the ideology that we can derive the theme or meaning of a document based on the terms that constitute the document. If we represent each document as a vector, each unique term in the document acts like a dimension in the space.
Feature hashing is a very efficient way to store data and reduce memory footprints. It requires less preprocessing, and it is a very effective way of dealing with large datasets. The Document Term Matrix holds all the data in the memory. When we apply a sparse function to this matrix, it will only retain all the data that are non-zero values. If the Document Term Matrix has lot of zeroes, we can use this and create a sparse matrix and reduce the memory footprint, but the sparse matrix function scans the entire data space to know which are the non-zero elements that it has to hold. Feature hashing is more efficient as it does not need a full scan; by holding the references or hashes addressed it can do the preprocessing at runtime or on the fly. When we are dealing with text data, very large matrices of features are formed. In order to reduce this to more relevant ones we remove the sparse terms, or take the top popular terms and discard the rest. In these types of solutions, we lose data, but in feature hashing, we don't lose any data.
Feature hashing is very useful tool when the user does not know the dimension of the feature vector. When we are performing text analysis, most of the time we consider the document as a bag-of-word representation and don't give much importance to the sequence in which the word appears. In this kind of document classification problem, we have to scan the entire dataset to know how many words we have, and find the dimension of the feature vector. In general, feature hashing is useful when we are working on streaming data or distributed data because it is difficult to know the real dimension of the feature vector.
The hash size tells us how big the data space should be, that is, the number of columns in the Document Term Matrix. We need to keep in mind the memory requirements and speed of the algorithm execution while selecting the hash size. We may choose this value empirically through trial and error. If it is too small, it may cause collisions (feature space collision), if it is too big the processing may become slow. There are text classification models which perform well with feature hashing.
glmnet
and xgboost
are a few of the R packages that support it.
Datasets supported by feature hashing include:
Wush Wu Feature Hashing developed the package available on CRAN. The following is the definition from the documentation:
"Feature hashing, also called the hashing trick, is a method to transform features to vector. Without looking up the indices in an associative array, it applies a hash function to the features and uses their hash values as indices directly. The method of feature hashing in this package was proposed in Weinberger et. al. (2009). The hashing algorithm is the murmurhash3 from the digest package. Please see the README.md for more information."
Classification or the supervised learning mechanism in machine learning, is the process of learning concepts or patterns generalizing the relationship between the dependent and independent variables, given a labeled or annotated data. A typical text classification task can be defined as follows:
Let's say, we have an opinion classification problem at hand. The training data contains texts as instances and opinions (positive or negative) as the outcome variable. The objective is to design a learning mechanism to be able to utilize the patterns in the training data to predict/label the outcome variable in an unknown or test dataset.
In order to learn the target concept or pattern, each instance t along with the associated concept c(t) from the training set is presented to the learner. The task assigned to the learning mechanism is to estimate the function c, such that the target concept stays generalized over most of the training instances and can be applied on unknown instances with high precision. The classifier creates a hypothesis for every training instance presented to it in conjunction with the associated label for the given instance. Once all the instances are observed, we have a large set of hypotheses generated, which are very specific in nature. With the inductive learning hypothesis principle, we know that, if a hypothesis can effectively approximate the target concept over a sufficiently large number of instances, it will also effectively approximate over an unknown set of instances. Such a hypothesis needs to be a generalized one as the specific hypothesis can approximate over a sufficiently large of number of instances and it would prove to be insufficient to approximate over unobserved instances.
Let's take a simple example to explain the concept of generalized and specific hypotheses. Suppose we have some data, corresponding to information about rain on a given day. The attributes given to us are temperature, humidity, and wind speed.
Let's consider two hypotheses trying to approximate the target concept of learning about the possibility of rain on a given day:
The first hypothesis is very specific in nature and it is highly unlikely that it can be approximated over more than just a few instances, while the second hypothesis is comparatively less specific and more generalized than the first one and it is more likely to fit on a larger number of instances.
Once all the possible hypotheses are generated, the learning mechanism sorts them in general-to-specific ordering. As we have discussed, it is desired to have more generalized hypotheses than specific ones. Given the large number of hypotheses generated, how do we go about choosing the optimal hypotheses?
There are a few methods recommended for this issue (Tom M. Mitchell, 1997). The Candidate Elimination, Find-S, and List-Then-Eliminate algorithms are used to eliminate the redundant and highly specific hypotheses to learn a generalized hypothesis . The details of these algorithms are not in the scope of this book. The basic ideology behind these algorithms is to search for hypotheses in the version space, which is a subset of the set created with all the possible hypotheses from the training data; and compare them to the other training instances to see how well they can generalize over multiple instances. Subsequently, they keep eliminating the redundant hypotheses or keep merging specific ones to give a more generalized structure.
In machine learning, the decision tree is a well-known classification algorithm. In this type of classifying methodology, we create a decision tree model. When we provide an input for a prediction based on the input variable it traverses through the nodes and reaches the leaf node which is nothing but the classifier class. When the target variables have a finite set of values it is called a classification tree, which we will discuss in this section. If the target variables take continuous values it is called regression trees.
Let us understand some basic terminologies used in the decision tree. The decision tree is constructed based on input variables. The number of input variables will alter the classification output; in technical terms, these input variables are also called attributes. In text mining, if we are classifying a set of documents into different topics, the significant words in the document become the attributes and the topics which are the resulting outcome become the classes.
The following is a simple diagram of a decision tree:
The tree is made up of branches and nodes. Every branch from a node signifies the outcome using one of the attributes. The nodes that do not have any branches are called leaf nodes. These nodes are at the end of the tree, and they are the final classes of the classification problem. Decision trees are built by recursively splitting training data so that each division tends towards one class. Every node that is not a leaf node has a branch and each branch is the outcome of one or more attributes which further influences how the data will be divided further. The split is made in such a way such that the data is distinct or as distinct as possible. In ML terms, this is called a pure leaf node; each division is either pure or we can improve the performance by increasing the generalization by pruning the tree.
Partitioning of data for splitting the node depends on the attributes used to split it. In order to construct the tree, we need to select the best splitting attribute. There are various algorithms used to build a decision tree. At a high level, these algorithms try to solve the challenges in their own optimal way as explained in the following steps:
In order to perform an optimal split and evaluate the goodness of the split, there are various methods:
Some of the well-known algorithms used in building decision trees are:
There are various pros and cons of decision trees. The pros are:
Some cons are that the decision trees generated may become very complex and do not generalize well. However, pruning techniques can be applied to resolve this.
In the following code, we are detecting if the speech is given by Obama or Romney based on the input document. We will use the rpart
library to build the decision tree:
rpart
code for decision trees:
library(tm) library(rpart)
obamaCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/obama" , encoding="UTF-8")) romneyCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/romney" , encoding="UTF-8"))
fullCorpus <- c(obamaCorpus,romneyCorpus)#1-22 (obama), 23-44(romney)
fullCorpus.cleansed <- tm_map(fullCorpus, removePunctuation) fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, stripWhitespace) fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, tolower) fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, removeWords, stopwords("english")) fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, PlainTextDocument)
full.dtm <- DocumentTermMatrix(fullCorpus.cleansed)
full.dtm.spars <- removeSparseTerms(full.dtm , 0.6)
full.matix <- data.matrix(full.dtm.spars) full.df <- as.data.frame(full.matix)
full.df[,"SpeakerName"] <- "obama" full.df$SpeakerName[21:44] <- "romney"
train.idx <- sample(nrow(full.df) , ceiling(nrow(full.df)* 0.6)) test.idx <- (1:nrow(full.df))[-train.idx]
70
terms used in the corpus:freqterms70 <- findFreqTerms( full.dtm.spars, 70)
rpart
function:outcome <- "SpeakerName" formula_str <- paste(outcome, paste(freqterms70, collapse=" + "), sep=" ~ ") formula <- as.formula(formula_str) fit <- rpart(formula, method="class", data=full.df.train,control=rpart.control(minsplit=5, cp=0.001)); print(fit)
cp
table The output is as follows:n= 27 1) root 27 13 romney (0.4814815 0.5185185) 2) health>=2.5 8 0 obama (1.0000000 0.0000000) * 3) health< 2.5 19 5 romney (0.2631579 0.7368421) 6) america< 3 3 0 obama (1.0000000 0.0000000) * 7) america>=3 16 2 romney (0.1250000 0.8750000) 14) one>=10 2 0 obama (1.0000000 0.0000000) * 15) one< 10 14 0 romney (0.0000000 1.0000000) * printcp(fit)
rpart(formula = formula, data = full.df.train, method = "class", control = rpart.control(minsplit = 5, cp = 0.001))
Variables actually used in tree construction: [1] Care Root node error: 13/27 = 0.48148 n= 27 CP nsplit rel error xerror xstd 1 0.61538 0 1.00000 1.53846 0.17516 2 0.23077 1 0.38462 0.53846 0.17516 3 0.15385 2 0.15385 0.76923 0.19302 4 0.00100 3 0.00000 0.69231 0.18842
par(mfrow = c(1,2), xpd = NA) text(fit, use.n=T)
The preceding image depicts the decision tree just created. At each node based on the attribute in the sentence it provides a weight to detect if it's an Obama speech or Romney speech.
Naive Bayes classifiers are probabilistic classifiers, built using the Bayes theorem, Naive Bayes is also known as priori probability and a class conditional probability classifier, since it uses prior probability of each feature and generates a posterior probability distribution over all the features.
This classifier makes the following assumptions about the data:
Though these assumptions may not be true in real world scenario, Naive Bayes is still used in many applications for text classification, such as:
This is because the classifier has its own strengths, such as:
Let's do a hands-on exercise on the Naive Bayes classifier.
E-mail is a one of the most widely used applications for communication. The following is an screenshot of an inbox:
Our inbox is cluttered with lot of e-mails every day; some of them are important, others are promotional e-mails, phishing e-mails, spam e-mails, and so on.
Let's build a spam classifier that can reduce the clutter by segregating the spam from the real important ones:
install.packages("caret") require(caret) install.packages("kernlab") require(kernlab) install.packages("e1071") require(e1071) install.packages("tm") require(tm) install.packages("plyr") require(plyr)
pathName <- "D:/R/Chap 6/enron1"
spam
folder contains all the mails that are spam ,and the ham
folder contains all the mails that are legitimate:emailSubDir <- c("ham","spam")
Let's write a function that can build the Term Document Matrix from the text input.
GenerateTDMForEMailCorpus <- function(subDir , path){
#mailDir <- sprintf("%s/%s", path, subDir) mailDir <-paste(path, subDir, sep="/")
DirSource
since we are dealing with directories, with encoding UTF-8:mailCorpus <- Corpus(DirSource(directory = mailDir , encoding="UTF-8"))
mail.tdm <- TermDocumentMatrix(mailCorpus)
mail.tdm <- removeSparseTerms(mail.tdm,0.7) Return the results: the list of TDM for spam and ham: result <- list(name = subDir , tdm = mail.tdm) }
Let's write a function that can convert the Term Document Matrix to a data frame.
BindMailTypeToTDM <- function(individualTDM){
mailMatrix <- t(data.matrix(individualTDM[["tdm"]]))
mailDataFrame <- as.data.frame(mailMatrix , stringASFactors = FALSE)
mailDataFrame <- cbind(mailDataFrame , rep(individualTDM[["name"]] , nrow(mailDataFrame)))
colnames(mailDataFrame)[ncol(mailDataFrame)] <- "MailType" return (mailDataFrame) } tdmList <- lapply(emailSubDir , GenerateTDMForEMailCorpus , path = pathName) mailDataFrame <- lapply(tdmList, BindMailTypeToTDM)
allMailDataFrame <- do.call(rbind.fill , mailDataFrame)
allMailDataFrame[is.na(allMailDataFrame)] <- 0
allMailDataFrame_ordered <- allMailDataFrame[ ,c(1:18,20:23,19)]
train.idx <- sample(nrow(allMailDataFrame_ordered) , ceiling(nrow(allMailDataFrame_ordered)* 0.6))
test.idx <- (1:nrow(allMailDataFrame_ordered))[-train.idx] allMailDataFrame.train <- allMailDataFrame_ordered[train.idx,] allMailDataFrame.test <- allMailDataFrame_ordered[test.idx,] trainedModel <- naiveBayes(allMailDataFrame.train[,c(1:22)],allMailDataFrame.train[,c(23)], data = allMailDataFrame.train) prediction <- predict(trainedModel, allMailDataFrame.test) confusionMatrix <- confusionMatrix(prediction,allMailDataFrame.test[,c(23)]) confusionMatrix
Reference Prediction ham spam ham 855 1 spam 626 586 Accuracy : 0.6968 95% CI : (0.6765, 0.7166) No Information Rate : 0.7162 P-Value [Acc > NIR] : 0.9753 ctable <- as.table(matrix(c(634 , 1, 466 , 450), nrow = 2, byrow = TRUE)) fourfoldplot(ctable, color = c("#CC6666", "#99CC99"), conf.level = 0, margin = 1, main = "Confusion Matrix")
Here is the depiction of the Confusion Matrix:
The K-Nearest neighbors algorithm (k-NN) works on the principle of distance functions for a given pair of points. It is very easy to implement and non-parametric algorithm in nature. In K-Nearest neighbor classifier, k is an integer greater than zero. This is a simple classification technique used to find the k, nearest data points in a dataset to a given data point. The biggest challenge with this classifier is to find out the optimal value for k which depends on the data. k-NN uses all the features for computing the distance and because of this the complexity for searching the nearest neighbors increases, which is one of the major drawbacks, since all the attributes or features in the dataset may not be very significant. Thus, providing certain weights to them based on significance, may increase the classifier accuracy.
The error rate of the k-NN classifier, can be equated to:
In the case where k=1, it is called the nearest neighbor classifier. The error rate for this can be calculated using the following equation:
Let x= (x1,x2, ...,xn) be the predicted points, then given a point a= (a1,a2, ..., an), we identify k observations in the training dataset that are similar to a. Neighbors are defined by a distance that we calculate between observations based on the independent variables. There are various ways where we can calculate the distance between the points: one of them is the Euclidean distance.
The Euclidean distance between the points (x1, x2, ..., xn) and (a1, a2, ..., an) is defined as:
For each n-dimensional object, the Euclidean distances between the specified object and all the training data objects are calculated and the specified object is assigned the class label that most of the k closest training data has. The curse of dimensionality and large feature sets are a problem for k-nn.
Let us classify the speeches of US presidential candidates using k-nn:
install.packages("class") require(class) install.packages("tm") require(tm) install.packages("plyr") require(plyr) install.packages("caret") require(caret)
options(stringAsFactors = FALSE)
speechDir <- c("romney","obama")
obama
folder contains all the speeches from Obama ,and the romney
folder contains all the speeches that are from Romney:pathToSpeeches <- "D:/R/Chap 6/Speeches"
CleanSpeechText <- function(speechText){
speechText.cleansed <- tm_map(speechText, removePunctuation)
speechText.cleansed <- tm_map(speechText, stripWhitespace)
speechText.cleansed <- tm_map(speechText, tolower)
speechText.cleansed <- tm_map(speechText, removeWords, stopwords("english"))
return (speechText.cleansed) }
produceTDM <- function(speechFolder,path){
speechDirPath <-paste(path, speechFolder, sep="/")
DirSource
to create the corpus:speechCorpus <- Corpus(DirSource(directory = speechDirPath , encoding="UTF-8"))
speechCorpus.cleansed <- CleanSpeechText(speechCorpus)
speech.tdm <- TermDocumentMatrix(speechCorpus.cleansed)
speech.tdm <- removeSparseTerms(speech.tdm,0.6)
tdm
:resultTdmList <- list(name = speechFolder , tdm = speech.tdm) }
addSpeakerName <- function(individualTDM){
speech.matix <- t(data.matrix(individualTDM[["tdm"]]))
seech.df <- as.data.frame(speech.matix)
seech.df <- cbind(seech.df , rep(individualTDM[["name"]] , nrow(seech.df)))
colnames(seech.df)[ncol(seech.df)] <- "SpeakerName" return (seech.df) } tdmList <- lapply(speechDir , produceTDM , path = pathToSpeeches) speechDfList <- lapply(tdmList, addSpeakerName)
combinedSpeechDf <- do.call(rbind.fill , speechDfList)
0
:combinedSpeechDf[is.na(combinedSpeechDf)] <- 0
train.idx <- sample(nrow(combinedSpeechDf) , ceiling(nrow(combinedSpeechDf)* 0.6))
test.idx <- (1:nrow(combinedSpeechDf))[-train.idx]
combinedSpeechDf.speakers <- combinedSpeechDf[,"SpeakerName"]
combinedSpeechDf.allAttr <- combinedSpeechDf[,!colnames(combinedSpeechDf) %in% "SpeakerName"]
combinedSpeechDf.train <- combinedSpeechDf.allAttr[train.idx,] combinedSpeechDf.test <- combinedSpeechDf.allAttr[test.idx,] combinedSpeechDf.trainOutcome <- combinedSpeechDf.speakers[train.idx] combinedSpeechDf.testOutcome <- combinedSpeechDf.speakers[test.idx] prediction <- knn(combinedSpeechDf.train ,combinedSpeechDf.test ,combinedSpeechDf.trainOutcome)
confusionMatrix <- confusionMatrix(prediction,testOutcome) Confusion Matrix and Statistics Reference Prediction romney obama romney 5 0 obama 4 8 Accuracy : 0.7647 95% CI : (0.501, 0.9319) No Information Rate : 0.5294