Document representation

The first step in the text classification process is to figure out how to represent the document in a manner which is suitable for classification tasks and learning algorithms. This step is basically intended to reduce the complexity of documents, making it easier to work with. While doing so, the following questions come to mind:

  • Do we need to preserve the order of words?
  • Is losing the information about the order a concern for us?

An attribute value representation of documents implies that the order of words in a document is not of high significance and each unique word in a document can be considered as a feature and the frequency of its occurrence is represented as a value. Further, discarding the features with a very low value or occurrence in the document can reduce the high dimensionality.

Vector space representation of words considers each word in a document as a vector. The attribute value representation may be having a Boolean form, set-of-words approach that captures the presence or absence of a word in a document, that is, if a particular word exists in a document or not. Apart from this representation, tf*idf, that is, term frequency multiplied by inverse document frequency, term weights can also be estimated and used as features.

Vector space models have some limitations too. For example, the document representation is very high dimensionally, and it may also lead to the loss of semantic relations or correlations between the different terms in the given documents. Analyzing text documents is tedious because there exists a near independence among the features; vocabulary size is huge, sometimes even larger than the number of instances. There will be certain words that occur a number of times, while there are words which will occur just a few times. How do we assign weights to the common words compared to rare words? Thus, document representation becomes an important part of this process.

Vector space models work on the ideology that we can derive the theme or meaning of a document based on the terms that constitute the document. If we represent each document as a vector, each unique term in the document acts like a dimension in the space.

Feature hashing

Feature hashing is a very efficient way to store data and reduce memory footprints. It requires less preprocessing, and it is a very effective way of dealing with large datasets. The Document Term Matrix holds all the data in the memory. When we apply a sparse function to this matrix, it will only retain all the data that are non-zero values. If the Document Term Matrix has lot of zeroes, we can use this and create a sparse matrix and reduce the memory footprint, but the sparse matrix function scans the entire data space to know which are the non-zero elements that it has to hold. Feature hashing is more efficient as it does not need a full scan; by holding the references or hashes addressed it can do the preprocessing at runtime or on the fly. When we are dealing with text data, very large matrices of features are formed. In order to reduce this to more relevant ones we remove the sparse terms, or take the top popular terms and discard the rest. In these types of solutions, we lose data, but in feature hashing, we don't lose any data.

Feature hashing is very useful tool when the user does not know the dimension of the feature vector. When we are performing text analysis, most of the time we consider the document as a bag-of-word representation and don't give much importance to the sequence in which the word appears. In this kind of document classification problem, we have to scan the entire dataset to know how many words we have, and find the dimension of the feature vector. In general, feature hashing is useful when we are working on streaming data or distributed data because it is difficult to know the real dimension of the feature vector.

The hash size tells us how big the data space should be, that is, the number of columns in the Document Term Matrix. We need to keep in mind the memory requirements and speed of the algorithm execution while selecting the hash size. We may choose this value empirically through trial and error. If it is too small, it may cause collisions (feature space collision), if it is too big the processing may become slow. There are text classification models which perform well with feature hashing.

glmnet and xgboost are a few of the R packages that support it.

Datasets supported by feature hashing include:

  • Characters and factors
  • Numerics and integers
  • Arrays

Wush Wu Feature Hashing developed the package available on CRAN. The following is the definition from the documentation:

"Feature hashing, also called the hashing trick, is a method to transform features to vector. Without looking up the indices in an associative array, it applies a hash function to the features and uses their hash values as indices directly. The method of feature hashing in this package was proposed in Weinberger et. al. (2009). The hashing algorithm is the murmurhash3 from the digest package. Please see the README.md for more information."

Classifiers – inductive learning

Classification or the supervised learning mechanism in machine learning, is the process of learning concepts or patterns generalizing the relationship between the dependent and independent variables, given a labeled or annotated data. A typical text classification task can be defined as follows:

  • Task T: to classify opinions (positive or negative)
  • Performance measure P: percentage of opinions correctly classified
  • Training experience E: Annotated or labeled data to train the model on

Let's say, we have an opinion classification problem at hand. The training data contains texts as instances and opinions (positive or negative) as the outcome variable. The objective is to design a learning mechanism to be able to utilize the patterns in the training data to predict/label the outcome variable in an unknown or test dataset.

In order to learn the target concept or pattern, each instance t along with the associated concept c(t) from the training set is presented to the learner. The task assigned to the learning mechanism is to estimate the function c, such that the target concept stays generalized over most of the training instances and can be applied on unknown instances with high precision. The classifier creates a hypothesis for every training instance presented to it in conjunction with the associated label for the given instance. Once all the instances are observed, we have a large set of hypotheses generated, which are very specific in nature. With the inductive learning hypothesis principle, we know that, if a hypothesis can effectively approximate the target concept over a sufficiently large number of instances, it will also effectively approximate over an unknown set of instances. Such a hypothesis needs to be a generalized one as the specific hypothesis can approximate over a sufficiently large of number of instances and it would prove to be insufficient to approximate over unobserved instances.

Let's take a simple example to explain the concept of generalized and specific hypotheses. Suppose we have some data, corresponding to information about rain on a given day. The attributes given to us are temperature, humidity, and wind speed.

Let's consider two hypotheses trying to approximate the target concept of learning about the possibility of rain on a given day:

  • Temperature = 30 degrees, humidity = 49%, wind speed = 8km/hr: it will not rain
  • Temperature = 30 degrees, humidity= 49%, wind speed=? (can take any value): it will not rain.

The first hypothesis is very specific in nature and it is highly unlikely that it can be approximated over more than just a few instances, while the second hypothesis is comparatively less specific and more generalized than the first one and it is more likely to fit on a larger number of instances.

Once all the possible hypotheses are generated, the learning mechanism sorts them in general-to-specific ordering. As we have discussed, it is desired to have more generalized hypotheses than specific ones. Given the large number of hypotheses generated, how do we go about choosing the optimal hypotheses?

There are a few methods recommended for this issue (Tom M. Mitchell, 1997). The Candidate Elimination, Find-S, and List-Then-Eliminate algorithms are used to eliminate the redundant and highly specific hypotheses to learn a generalized hypothesis . The details of these algorithms are not in the scope of this book. The basic ideology behind these algorithms is to search for hypotheses in the version space, which is a subset of the set created with all the possible hypotheses from the training data; and compare them to the other training instances to see how well they can generalize over multiple instances. Subsequently, they keep eliminating the redundant hypotheses or keep merging specific ones to give a more generalized structure.

Tree-based learning

In machine learning, the decision tree is a well-known classification algorithm. In this type of classifying methodology, we create a decision tree model. When we provide an input for a prediction based on the input variable it traverses through the nodes and reaches the leaf node which is nothing but the classifier class. When the target variables have a finite set of values it is called a classification tree, which we will discuss in this section. If the target variables take continuous values it is called regression trees.

Let us understand some basic terminologies used in the decision tree. The decision tree is constructed based on input variables. The number of input variables will alter the classification output; in technical terms, these input variables are also called attributes. In text mining, if we are classifying a set of documents into different topics, the significant words in the document become the attributes and the topics which are the resulting outcome become the classes.

The following is a simple diagram of a decision tree:

Tree-based learning

The tree is made up of branches and nodes. Every branch from a node signifies the outcome using one of the attributes. The nodes that do not have any branches are called leaf nodes. These nodes are at the end of the tree, and they are the final classes of the classification problem. Decision trees are built by recursively splitting training data so that each division tends towards one class. Every node that is not a leaf node has a branch and each branch is the outcome of one or more attributes which further influences how the data will be divided further. The split is made in such a way such that the data is distinct or as distinct as possible. In ML terms, this is called a pure leaf node; each division is either pure or we can improve the performance by increasing the generalization by pruning the tree.

Partitioning of data for splitting the node depends on the attributes used to split it. In order to construct the tree, we need to select the best splitting attribute. There are various algorithms used to build a decision tree. At a high level, these algorithms try to solve the challenges in their own optimal way as explained in the following steps:

  1. Select the best attributes for splitting and determine the split values.
  2. Ascertain the number of splits at each node.
  3. Ascertain the order of the attribute that has to be considered for splitting. Can the attributes be considered only once or many times?
  4. Choose the pruning method for the tree: pre pruning or post-pruning.
  5. Choose the growth and stopping criteria for the tree.

In order to perform an optimal split and evaluate the goodness of the split, there are various methods:

  • Gini index
  • Information gain
  • Entropy
  • Split info
  • Gain ratio

Some of the well-known algorithms used in building decision trees are:

  • ID3
  • C4.5
  • C5
  • CART
  • CHAID
  • SPRINT

There are various pros and cons of decision trees. The pros are:

  • They are easy to understand and visualize. The interpretability makes them a good choice in scenarios where simplicity of the model is preferred over accuracy.
  • They can be used in case of both categorical data and numerical data.
  • The performance of the model can be assessed using various statistical methods. Huge datasets can be analyzed in a reasonable amount of time since the algorithms are less complex and easy to compute.
  • Decision trees perform feature selections implicitly.
  • Non-linear relationships between variables do not cause issues in performance.

Some cons are that the decision trees generated may become very complex and do not generalize well. However, pruning techniques can be applied to resolve this.

In the following code, we are detecting if the speech is given by Obama or Romney based on the input document. We will use the rpart library to build the decision tree:

rpart code for decision trees:

library(tm)
library(rpart)
  1. Load the files into the corpus:
    obamaCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/obama" , encoding="UTF-8"))
    
    romneyCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/romney" , encoding="UTF-8"))
  2. Merge both the corpus to one big corpus:
    fullCorpus <- c(obamaCorpus,romneyCorpus)#1-22 (obama), 23-44(romney)
  3. Do basic processing on the loaded corpus.
  4. Now we will perform basic cleansing on the data. That includes removing punctuation marks, stripping whitespaces, converting the text to lower case, and removing stop words from the text data:
    fullCorpus.cleansed <- tm_map(fullCorpus, removePunctuation)
    
    fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, stripWhitespace)
    
    fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, tolower)
    
    fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, removeWords, stopwords("english"))
    
    fullCorpus.cleansed <- tm_map(fullCorpus.cleansed, PlainTextDocument)
  5. Create the Document Term Matrix for analysis:
    full.dtm <- DocumentTermMatrix(fullCorpus.cleansed)
  6. Remove the sparse terms:
    full.dtm.spars <- removeSparseTerms(full.dtm , 0.6)
  7. Convert the DocumentTermMatrix to a data frame for easy manipulation:
    full.matix <- data.matrix(full.dtm.spars)
    full.df <- as.data.frame(full.matix)
  8. Add the speaker's name to the data frame:
    full.df[,"SpeakerName"] <- "obama"
    full.df$SpeakerName[21:44] <- "romney"
  9. Create the Training and Test Index:
    train.idx <- sample(nrow(full.df) , ceiling(nrow(full.df)* 0.6))
    test.idx <- (1:nrow(full.df))[-train.idx]
  10. Select the top 70 terms used in the corpus:
    freqterms70 <- findFreqTerms( full.dtm.spars, 70)
  11. Create the formulas for input to the rpart function:
    outcome <- "SpeakerName"
    formula_str <- paste(outcome, paste(freqterms70, collapse=" + "), sep=" ~ ")
    
    
    formula <- as.formula(formula_str)
    
    fit <- rpart(formula, method="class", data=full.df.train,control=rpart.control(minsplit=5, cp=0.001));
    
    print(fit) 
  12. Display the cp table The output is as follows:
    n= 27 
    
    
     1) root 27 13 romney (0.4814815 0.5185185)  
       2) health>=2.5 8  0 obama (1.0000000 0.0000000) *
       3) health< 2.5 19  5 romney (0.2631579 0.7368421)  
         6) america< 3 3  0 obama (1.0000000 0.0000000) *
         7) america>=3 16  2 romney (0.1250000 0.8750000)  
          14) one>=10 2  0 obama (1.0000000 0.0000000) *
          15) one< 10 14  0 romney (0.0000000 1.0000000) *
    
    printcp(fit) 
  13. Plot the cross-validation results.
  14. Classification tree:
    rpart(formula = formula, data = full.df.train, method = "class", 
        control = rpart.control(minsplit = 5, cp = 0.001))
  15. The output is as follows:
    Variables actually used in tree construction:
    [1] Care 
    
    Root node error: 13/27 = 0.48148
    
    n= 27 
    
           CP nsplit rel error  xerror    xstd
    1 0.61538      0   1.00000 1.53846 0.17516
    2 0.23077      1   0.38462 0.53846 0.17516
    3 0.15385      2   0.15385 0.76923 0.19302
    4 0.00100      3   0.00000 0.69231 0.18842
  16. Let's plot the tree:
    par(mfrow = c(1,2), xpd = NA)
    text(fit, use.n=T)
Tree-based learning

The preceding image depicts the decision tree just created. At each node based on the attribute in the sentence it provides a weight to detect if it's an Obama speech or Romney speech.

Bayesian classifiers: Naive Bayes classification

Naive Bayes classifiers are probabilistic classifiers, built using the Bayes theorem, Naive Bayes is also known as priori probability and a class conditional probability classifier, since it uses prior probability of each feature and generates a posterior probability distribution over all the features.

This classifier makes the following assumptions about the data:

  • It assumes all the features in the dataset are independent of each other
  • It assumes all the features are important

Though these assumptions may not be true in real world scenario, Naive Bayes is still used in many applications for text classification, such as:

  • Spam filtering for e-mail applications
  • Social media mining, such as finding the sentiments in a given text
  • Computer network security applications

This is because the classifier has its own strengths, such as:

  • Naive Bayes classifiers are highly scalable and need less computational cycles when compared with other advanced and sophisticated classifiers
  • A huge number of features can be taken into consideration
  • They work well even when there is missing data and the dimensionality of the inputs is high
  • They need only small amounts of training sets

Let's do a hands-on exercise on the Naive Bayes classifier.

E-mail is a one of the most widely used applications for communication. The following is an screenshot of an inbox:

Bayesian classifiers: Naive Bayes classification

Our inbox is cluttered with lot of e-mails every day; some of them are important, others are promotional e-mails, phishing e-mails, spam e-mails, and so on.

Let's build a spam classifier that can reduce the clutter by segregating the spam from the real important ones:

  1. We will set up the environment by loading all the required library.
  2. The dataset used in this code can be downloaded from http://www.aueb.gr/users/ion/data/enron-spam/
  3. The data is Enron-Spam in pre-processed form: Enron1
    install.packages("caret")
    require(caret) 
    
    install.packages("kernlab") 
    require(kernlab)
    
    install.packages("e1071")
    require(e1071)
    
    install.packages("tm")
    require(tm) 
    
    install.packages("plyr")
    require(plyr)
  4. Extract the Enron-Spam dataset to your local filesystem. In my case it's in the following directory:
    pathName <- "D:/R/Chap 6/enron1"
  5. The dataset has two sub-folders: the spam folder contains all the mails that are spam ,and the ham folder contains all the mails that are legitimate:
    emailSubDir <- c("ham","spam")

Let's write a function that can build the Term Document Matrix from the text input.

  1. Build a Term Document Matrix. Here we are converting the text to a quantitative format for analysis:
    GenerateTDMForEMailCorpus <- function(subDir , path){
  2. Concatenate the variable, that is, the path and the sub-folder name to create the complete path to the mail corpus directory:
    #mailDir <- sprintf("%s/%s", path, subDir)
    
    mailDir <-paste(path, subDir, sep="/")
  3. Create a corpus using the preceding computed directory path. We will use DirSource since we are dealing with directories, with encoding UTF-8:
    mailCorpus <- Corpus(DirSource(directory = mailDir , encoding="UTF-8"))
  4. Create the Term Document Matrix:
    mail.tdm <- TermDocumentMatrix(mailCorpus)
  5. Remove sparse terms from TDM for better analysis:
    mail.tdm <- removeSparseTerms(mail.tdm,0.7)
    Return the results: the list of TDM for spam and ham:
    result <- list(name = subDir , tdm = mail.tdm)
    }

Let's write a function that can convert the Term Document Matrix to a data frame.

  1. We will convert the TDM to a data frame and append the type of mail if it's a spam or ham in the data frame:
    BindMailTypeToTDM <- function(individualTDM){
  2. Create a numeric matrix, get its transpose so that the column contains the words and the row contains the number of word occurrences in the mail:
    mailMatrix <- t(data.matrix(individualTDM[["tdm"]]))
  3. Convert this matrix into a data frame since it's easy to work with data frames:
    mailDataFrame <- as.data.frame(mailMatrix , stringASFactors = FALSE)
  4. Add the type of mail to each row in the data frame:
    mailDataFrame <- cbind(mailDataFrame , rep(individualTDM[["name"]] , nrow(mailDataFrame)))
  5. Give a proper name to the last column of the data frame:
    colnames(mailDataFrame)[ncol(mailDataFrame)] <- "MailType"
    return (mailDataFrame)
    
    }
    
    tdmList <- lapply(emailSubDir , GenerateTDMForEMailCorpus , path = pathName)
    
    mailDataFrame <- lapply(tdmList, BindMailTypeToTDM)
  6. Join both the data frames for spam and ham:
    allMailDataFrame <- do.call(rbind.fill , mailDataFrame)
  7. Fill the empty columns with 0:
    allMailDataFrame[is.na(allMailDataFrame)] <- 0
  8. Reorder the column for readability:
    allMailDataFrame_ordered <- allMailDataFrame[ ,c(1:18,20:23,19)]
  9. Prepare a training set. We are getting about 60% of the rows to train the model:
    train.idx <- sample(nrow(allMailDataFrame_ordered) , ceiling(nrow(allMailDataFrame_ordered)* 0.6))
  10. Prepare the test set, with the remaining rows that are not part of training sample:
    test.idx <- (1:nrow(allMailDataFrame_ordered))[-train.idx]
    
    allMailDataFrame.train <- allMailDataFrame_ordered[train.idx,]
    
    allMailDataFrame.test <- allMailDataFrame_ordered[test.idx,]
    
    trainedModel <- naiveBayes(allMailDataFrame.train[,c(1:22)],allMailDataFrame.train[,c(23)], data = allMailDataFrame.train)
    
    prediction <- predict(trainedModel, allMailDataFrame.test)
    
    confusionMatrix <- confusionMatrix(prediction,allMailDataFrame.test[,c(23)])
    
    confusionMatrix
  11. Confusion Matrix and Statistics
              Reference
    Prediction ham spam
          ham  855    1
          spam 626  586
                                              
                   Accuracy : 0.6968          
                     95% CI : (0.6765, 0.7166)
        No Information Rate : 0.7162          
        P-Value [Acc > NIR] : 0.9753          
    
    ctable <- as.table(matrix(c(634   , 1, 466 , 450), nrow = 2, byrow = TRUE))
    
    
    fourfoldplot(ctable, color = c("#CC6666", "#99CC99"), conf.level = 0, margin = 1, main = "Confusion Matrix")

Here is the depiction of the Confusion Matrix:

Bayesian classifiers: Naive Bayes classification

K-Nearest neighbors

The K-Nearest neighbors algorithm (k-NN) works on the principle of distance functions for a given pair of points. It is very easy to implement and non-parametric algorithm in nature. In K-Nearest neighbor classifier, k is an integer greater than zero. This is a simple classification technique used to find the k, nearest data points in a dataset to a given data point. The biggest challenge with this classifier is to find out the optimal value for k which depends on the data. k-NN uses all the features for computing the distance and because of this the complexity for searching the nearest neighbors increases, which is one of the major drawbacks, since all the attributes or features in the dataset may not be very significant. Thus, providing certain weights to them based on significance, may increase the classifier accuracy.

The error rate of the k-NN classifier, K-Nearest neighbors can be equated to:

K-Nearest neighbors
  • N : is the set of points
  • k: is the set of classesK-Nearest neighbors
  • is the Bayes error rate

In the case where k=1, it is called the nearest neighbor classifier. The error rate for this can be calculated using the following equation:

K-Nearest neighbors

Let x= (x1,x2, ...,xn) be the predicted points, then given a point a= (a1,a2, ..., an), we identify k observations in the training dataset that are similar to a. Neighbors are defined by a distance that we calculate between observations based on the independent variables. There are various ways where we can calculate the distance between the points: one of them is the Euclidean distance.

The Euclidean distance between the points (x1, x2, ..., xn) and (a1, a2, ..., an) is defined as:

K-Nearest neighbors

For each n-dimensional object, the Euclidean distances between the specified object and all the training data objects are calculated and the specified object is assigned the class label that most of the k closest training data has. The curse of dimensionality and large feature sets are a problem for k-nn.

Let us classify the speeches of US presidential candidates using k-nn:

  1. Load library using the following code:
    install.packages("class")
    require(class)
    
    install.packages("tm")
    require(tm) 
    
    install.packages("plyr")
    require(plyr) 
    
    install.packages("caret")
    require(caret) 
  2. To make sure strings are not converted to nominal or categorical variables, we set the following option:
    options(stringAsFactors = FALSE)
  3. Extract the Speech dataset to your local filesystem. In my case it's in the following directory:
    speechDir <- c("romney","obama")
  4. The dataset has two sub-folders: the obama folder contains all the speeches from Obama ,and the romney folder contains all the speeches that are from Romney:
    pathToSpeeches <- "D:/R/Chap 6/Speeches"
  5. Data cleaning is an essential step when we do analysis on text data, such as removing numbers, stripping whitespaces, removing punctuation, removing stop words, and, changing to lowercase:
    CleanSpeechText <- function(speechText){
  6. We will remove all punctuation characters from the text:
    speechText.cleansed <- tm_map(speechText, removePunctuation)
  7. We will remove all whitespace from the text:
    speechText.cleansed <- tm_map(speechText, stripWhitespace)
  8. We will convert all the words to lowercase:
    speechText.cleansed <- tm_map(speechText, tolower)
  9. We will remove all stop words related to English:
    speechText.cleansed <- tm_map(speechText, removeWords, 
    stopwords("english"))
  10. Return the cleansed text:
    return (speechText.cleansed)
    }
  11. We will build a term document matrix. Here we are converting the text to quantitative format for analysis:
    produceTDM <- function(speechFolder,path){
  12. Concatenate the strings to get the full path to the respective speeches:
    speechDirPath <-paste(path, speechFolder, sep="/")
  13. Since it's a directory use DirSource to create the corpus:
    speechCorpus <- Corpus(DirSource(directory = speechDirPath , encoding="UTF-8"))
  14. Clean this corpus to remove unwanted noise in the text to make our analysis better:
    speechCorpus.cleansed <- CleanSpeechText(speechCorpus)
  15. Build the term document matrix for this cleansed corpus:
    speech.tdm <- TermDocumentMatrix(speechCorpus.cleansed)
  16. Remove the sparse terms to improve the prediction:
    speech.tdm <- removeSparseTerms(speech.tdm,0.6)
  17. Returns the result of both the speeches as a list of tdm:
    resultTdmList <- list(name = speechFolder , tdm = speech.tdm)
    }
  18. We will add the speaker's name to the TDM for training and testing:
    addSpeakerName <- function(individualTDM){
  19. Create a numeric matrix, get its transpose so that the column contains the words and the row contains the number of word occurrences in the speech:
    speech.matix <- t(data.matrix(individualTDM[["tdm"]]))
  20. Convert this matrix into data frame since it's easy to work with data frames:
    seech.df <- as.data.frame(speech.matix)
  21. Add the speaker's name to each row in the data frame:
    seech.df <- cbind(seech.df , rep(individualTDM[["name"]] , nrow(seech.df)))
  22. Give a proper name to the last column of the data frame:
    colnames(seech.df)[ncol(seech.df)] <- "SpeakerName"
    
    return (seech.df)
    
    }
    
    tdmList <- lapply(speechDir , produceTDM , path = pathToSpeeches)
    speechDfList <- lapply(tdmList, addSpeakerName)
  23. Join both the data frames for Obama and Romney:
    combinedSpeechDf <- do.call(rbind.fill , speechDfList)
  24. Fill the empty columns with 0:
    combinedSpeechDf[is.na(combinedSpeechDf)] <- 0
  25. Prepare a training set. We are getting about 60% of the rows to train the modal:
    train.idx <- sample(nrow(combinedSpeechDf) , ceiling(nrow(combinedSpeechDf)* 0.6))
  26. Prepare the test set with the remaining rows that are not part of training sample:
    test.idx <- (1:nrow(combinedSpeechDf))[-train.idx]
  27. Let's create a data frame that only has the speaker names of the training set:
    combinedSpeechDf.speakers <- combinedSpeechDf[,"SpeakerName"]
  28. Let's create a data frame that contains all the attributes except the speaker name:
    combinedSpeechDf.allAttr <- combinedSpeechDf[,!colnames(combinedSpeechDf) %in% "SpeakerName"]
  29. Let's use the preceding training set and test set to create inputs to our classifier:
    combinedSpeechDf.train <- combinedSpeechDf.allAttr[train.idx,]
    
    combinedSpeechDf.test <- combinedSpeechDf.allAttr[test.idx,]
    
    
    
    combinedSpeechDf.trainOutcome <- combinedSpeechDf.speakers[train.idx]
    
    combinedSpeechDf.testOutcome <- combinedSpeechDf.speakers[test.idx]
    
    prediction <- knn(combinedSpeechDf.train ,combinedSpeechDf.test ,combinedSpeechDf.trainOutcome)
  30. Let's check out the Confusion matrix:
    confusionMatrix <- confusionMatrix(prediction,testOutcome)
    
    Confusion Matrix and Statistics
              Reference
    Prediction romney obama
        romney      5     0
        obama       4     8
                                             
                   Accuracy : 0.7647         
                     95% CI : (0.501, 0.9319)
        No Information Rate : 0.5294   
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset