Kernel methods

Kernel methods exploit the similarity between documents, that is, by length, topic, language, and so on, to extract patterns from the documents. Inner products between data items can reveal a lot of latent information; in fact many of the standard algorithms can be represented in the form of inner products between data items in a potentially complex feature space. The reason why kernel methods are suitable for high dimensional data is that the complexity only depends on the choice of kernel, it does not depend upon the features of the data in use. Kernels solve the computational issues by transforming the data into richer feature spaces and non-linear features and then applying linear classifier to the transformed data, as shown in the following diagram:

Kernel methods

Some of the kernel methods available are:

  • Linear kernel
  • Polynomial kernel
  • Radical base function kernel
  • Sigmoid kernel

Support vector machines

Support vector machines (SVM) is a kernel method of classification, which gained a lot of traction among the classification mechanisms because of its ability to work well on high dimensional data. Unlike other classifiers, the complexity does not depend on the number of features, but the margin with it separates the instances into different classes. Thus, an increase in the number of dimensions does not affect the computational cost.

SVM is a non-probabilistic classifier, which in its basic configuration, is expected to learn a linear threshold function. But it can also be made to learn RBF networks, such as radial basis functions, polynomial functions, sigmoid neural nets, and so on.

SVM attempts to find a linear hyper plane, which can separate the data. It is also called a large margined classifier since it tries to find the hyper plane with the largest margin. Any linear learner has to estimate an explicit mapping in order to learn a non-linear function or work on a dataset with a non-linear decision boundary, as shown in the following diagram:

Support vector machines

How do we choose the plane which is the best linear separator? If we join the points on both sides of the plane or the points owing to both the classes separately such that we get the most compact convex hull on both the sides. Once we get the convex hulls on both the sides, we draw a line joining the closest instance with the minimum margin between classes. The perpendicular bisector to this line is the optimal hyper plane which is the best linear separator. How do we find the closest points in the two convex hulls? This is done using an optimization algorithm as shown in the following diagram:

Support vector machines

If A and B were two convex hulls created, suppose the line parallel to the convex hull A is x'w=K and the line parallel to convex hull B is x'w = n, then the linear separator would be x'w= k+n/2.

The optimization problem to find the closest point in both the convex hulls is given by the following equation:

Support vector machines

If c = A'u and d = B'v are the two convex hulls

Support vector machines
Support vector machines

Where:Support vector machines

is the predicted label for the unseen input X'.

Support vector machines

is the kernel function that estimates the similarity between the input

data points and Support vector machines are the weights the sign function determines the predicted outcome to be negative or positive.

Kernel Trick

If we look at the following diagram, we can clearly see that the data in the first figure is not linearly separable. But if we map this data to a three-dimensional space, we observe that we can linearly separate the data now, using a hyperplane as the decision boundary. Transforming the instance space to a new non-linear mapping, Support vector machines are able to implement non-linear class boundaries. This phenomenon is called the Kernel Trick. A representation follows:

Kernel Trick

The Kernel Trick helps in using the dot product based methods in a possibly infinite dimensional feature space, without actually performing the computationally expensive process of projecting the features into such a high dimensional space explicitly:

Kernel Trick

The preceding graph is for when the linear function cannot draw the hyperplane:

Kernel Trick

We can find out the best kernel function for our domain by running the classifiers through all the functions and analyzing the results. In the last two sections, we have shown you the basic steps of creating the DTM from text files. We will now use the speech TDM to run our SVN classifier and predict who gave that speech:

sv <- svm(combinedSpeechDf.train, combinedSpeechDf.trainOutcome)
pred <-predict(sv,combinedSpeechDf.test)
table(pred, combinedSpeechDf.testOutcome)

  pred     romney obama
  romney     10     0
  obama       0     7

R has a package kernlab that supports various kernel methods that can be applied to SVM.

kernlab provides implementations to the most used kernel functions such as:

  • rbfdot: Radial Basis kernel "Gaussian"
  • polydot: Polynomial kernel
  • vanilladot: Linear kernel
  • tanhdot: Hyperbolic tangent kernel
  • laplacedot: Laplacian kernel
  • besseldot: Bessel kernel
  • anovadot: ANOVA RBF kernel
  • splinedot: Spline kernel
  • stringdot: String kernel

For more details go visit the kernlab at http://www.inside-r.org/node/63499

The kernlab package in R provides different kernels to be used, while the user can write their own kernel functions as well.

There are four basic kernel functions:

  • Linear: K(xi,xj) = xTi xj.
  • Polynomial: K(xi,xj)=(γxiTxj +r)d,γ>0.
  • Radial Basis function (RBF): K(xi,xj)=exp(−γxi−xj2),γ>0.
  • Sigmoid: K(xi,xj)=tanh(γxiTxj +r).

Here, γ, r, and d are kernel parameters.

How to apply SVM on a real world example?

  • Perform a data transformation appropriate to apply as per the SVM package:
    • Convert categorical variables into numeric data by using binary dummies.
    • For example, if there is a variable gender, there are two possible values for this categorical attribute. We can create binary dummies, gender_male and gender_female, as two variables replacing the variable gender. Thus each of the instances where the gender was male, in place of that we will have gender_male as 1 and gender_female as 0.
  • Scale the data appropriately:
    • It is important to scale or normalize the data before applying SVM to counter any computational challenges.
    • Also, it may lead to issues if there is huge variance in the data.
  • Select the kernel function.

    You can randomly select a kernel function and estimate the best parameters for the chosen kernel. Test the performance and further tune or select a different kernel function.

    Normally, the Radial Basis function RBF is the first choice of kernel. It performs reasonably well when the decision boundary is non-linear, as it can effectively map the samples into the high dimensional space, which linear classifier cannot do.

    Also, the number of hyper parameters that we need to optimize in order to get the best set of parameters to apply in case of RBF is lesser than the other polynomial kernel functions. In RBF, we need to choose a penalty parameter, C(>0) and

It is also important to note that, when the dimensions are very high compared to the number of instances available to train on, the linear kernel would work better than RBF.

Note

When to apply a linear kernel?

When the number of dimensions is significantly larger than the number of instances: In such a case, an explicit mapping to high dimensional space does not add any value. Thus even a linear kernel performs better.

When the dimensions as well as instances are large in number.

Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier

Maximum entropy classifiers belong to an exponential class of models. These classify the instances based on the least biased estimate available from the given information or the constraints applicable. For example, let's consider a three-way classification task, where the prior information given is that on average 50% of the documents which contain the word equity belongs to the class investment. Based on this information, whichever document we find the word equity in, we assume there is a 50% probability of this document being classified as an investment class. What about the documents where we do not find the word equity? In such a case, we would assume a uniform class distribution, 33.33% each. Such a model, which complies with the constraints, is known as the maximum entropy model. Maximum entropy models provide us with the least biased estimate, complying with the constraints on conditional distribution, and are heavily noncommittal towards the missing information.

In the simplest terms, when we use statistical models to categorize instances associated to an unknown event, we should always categorize them based on the entropy estimates being the maximum. The principle of maximum entropy implies that given all the models that can fit to our training set, the model, which has the maximum entropy, should be chosen. The core ideology is to learn the probability distribution from the given dataset without assuming any prior distribution other than the observed one, and select the distribution with the maximum entropy subject to the constraints implied. Maximum entropy implies the least assumption and uniformity in distribution. Maxent models are different from the Naïve Bayes classifiers in their basic assumption over the feature independence. Naïve Bayes assumes the features to be conditionally independent to each other, while Maxent does not assume the same. Getting rid of this assumption makes Maxent classifiers, which are to be used regularly in the scenarios where either there is not much information on the prior probability distribution available or it is perceived to be unsafe, to assume the conditional independence of the attributes.

The objective is to utilize the context predicates or the information such as unigram, bigram, and other text characteristics, to build a stochastic model to assign a class to each of the context or instances. Assume the training data T = {(t1, c1), (t2, c2)... (tN, cN)} where t1,t2,t3…. are the context information and ci is the class assigned to the respective contexts. The training data is a set of context predicates, each of them represented by a vector or words. We would like to estimate a probability distribution that represents the contexts. Each context must be assigned to one of the classes P(c1) + P(c2) + P(c3) + ………..+ P(cN) = 1.

As discussed at the onset of this section, if there is no prior information provided we assume a uniform distribution of the one with the least assumptions possible P(c1) = P(c2) = P(c3) = ………..= P(cN) = 1/N.

Let's add some prior information to the scenario and observe how the distribution changes accordingly. Let's say, if there is prior information given to us that, if there is a word equity present in the contexts, there is a 50% probability of the context being classified as c1. How does this information affect the distribution we just came up with?

P(c1 | equity) = 0.50

P(c2 | equity ) = P(c3 | equity ) = ……..= P(cN | equity ) = (1-0.50)/N

Maxent implemenation in R

The Maxent package provides a low-memory implementation of multinomial logistic regression or the maximum entropy model. This package leverages a C++ library to perform a memory-efficient implementation of the maximum entropy algorithm, which consumes lot of memory if the corpus is large. The parameter estimation process is streamlined and the reduced number of parameters helps minimize the memory consumption. L-BFGS, OWLQN, and stochastic gradient descent optimization are the different optimization techniques used for parameter estimation.

library(maxent)
data <- read.csv(system.file("data/USCongress.csv.gz",package = "maxent"))

We will use the tm package to build the corpus from the data that we loaded. After which, we will convert it to TermDocumentMatrix or DocumentTermMatrix. We will use the as.compressed.matrix() function from maxent package to convert the term document matrix or document matrix into a compressed matrix format, matrix.csr.

library(tm)
corpus <- Corpus(VectorSource(data$text))
dtm <- TermDocumentMatrix(corpus,
                             control=list(weighting = weightTfIdf,
                                          language = "english",
                                          tolower = TRUE,
                                          stopwords = TRUE,
                                          removeNumbers = TRUE,
                                          removePunctuation = TRUE,
                                          stripWhitespace = TRUE))
# This step is important, because maxent does not support tdm or dtm formats, we'll convert the term document matrix to a compressed matrix

matrix_sp <- as.compressed.matrix(dtm)

Now, we will train our maxent model on the training data, specifying the independent and dependent variable.

# Not to run
maxent(feature_matrix, code_vector, l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, set_heldout = 0, verbose = FALSE)

Tip

If the training sample is huge, its best advised to set the use_sgd argument as TRUE, to be able to use stochastic gradient descent.

l1_regularizer and l2_regularizer are set to 0 by default. In the event of over fitting, l1_regulartization, l2_regularization and set_holdout parameters can be tuned to overcome over fitting.

The number of iterations for SGD is set at 30 by default, and the learning rate alpha is set at 0.85 by default. L1 and L2 regularization cannot be used together, thus l1_regularizer and l2_regularizer should not be set together.

Stochastic gradient descent does not support L2 regularization, thus if use_sgd is set to be TRUE, the l2_regularization parameter should be left as the default value of 0.

max_model <- maxent(matrix_sp[,1:2000],data$major[1:2000],use_sgd = TRUE,
                set_heldout = 200)

We can also save the model, to save ourselves from training again, and be able to directly load the saved model and use it for predictions:

save.model(max_model, "Model")
max_model <- load.model("Model")

We will use the trained model to predict on the test data:

results <- predict(max_model, matrix_sp[,2001:2400])

The Maxent package provides a function, tune.maxent, to tune the maxent model. The parameters that are altered are l1_regularizer, l2_regularizer, use_sgd, and set_holdout. 11_regularizer and l2_regularizer vary between 0 to 1 at a space of 0.2 each. Set_holdout is the number of samples held out for cross-validation. K-fold cross-validation is used to validate the model to avoid overfiting.

model_tune <- tune.maxent(matrix_sp[,1:5000],+ data$major[1:5000],nfold=3, showall=TRUE)


model_tune
      l1_regularizer l2_regularizer use the_sgd set_heldout  accuracy pct_best_fit:
 [1,]            0.0            0.0       0           0 0.7215367    0.9460567
 [2,]            0.2            0.0       0           0 0.7416078    0.9723734
 [3,]            0.4            0.0       0           0 0.7412365    0.9718866
 [4,]            0.6            0.0       0           0 0.7364983    0.9656740
 [5,]            0.8            0.0       0           0 0.7291518    0.9560415
 [6,]            1.0            0.0       0           0 0.7211886    0.9456004
 [7,]            0.0            0.0       0         742 0.7215367    0.9460567
 [8,]            0.0            0.2       0           0 0.7626780    1.0000000
 [9,]            0.0            0.4       0           0 0.7540851    0.9887333
[10,]            0.0            0.6       0           0 0.7479785    0.9807265
[11,]            0.0            0.8       0           0 0.7407543    0.9712542
[12,]            0.0            1.0       0           0 0.7371181    0.9664866
[13,]            0.0            0.0       1           0 0.7598019    0.9962289
[14,]            0.2            0.0       1           0 0.7416078    0.9723734
[15,]            0.4            0.0       1           0 0.7412365    0.9718866
[16,]            0.6            0.0       1           0 0.7364983    0.9656740
[17,]            0.8            0.0       1           0 0.7291518    0.9560415
[18,]            1.0            0.0       1           0 0.7211886    0.9456004

optimal_model <- maxent(matrix_sp[,1:2000],data$major[1:2000],l2_regularizer= 0.2, use_sgd = FALSE)


results <- predict(optimal_model, matrix_sp[,2001:2400]) 

RTextTools: a text classification framework

Until now we have seen how to run individual classifiers to classify text data. There are various R packages that support numerous classification methods. In order to begin analyzing data with various classifiers, you can use a very powerful yet simple to use R package called RTextTools. This package provides support to most widely used classifiers, and it also provides a tuning mechanism to experiment with different settings of algorithms for expert users. RTextTools uses a variety of existing R packages to support text pre-processing and machine learning algorithms.

The following are the basic steps to run various classifiers using RTextTools:

  1. Load the data files. They can be csv, excel, and so on.
  2. Create a matrix object. This is an object of the class, DocumentTermMatrix. We use the create_matrix() method to get this object. Various pre-processing actions can be applied in this method such as removeNumbers, removePunctuation, removeSparseTerms, removeStopwords, stemWords, stripWhitespace, toLower, and weighting=weightTf.
  3. Create a container object. This object contains train and test sets of matrices which will be used as inputs to the machine learning algorithms with the labels. This object will be used in the subsequent steps of analyzing data. For this we use the create_container() function.
  4. Train the models. We train specific models using the container and a specific algorithm or list of supported algorithms as inputs. RTextTools provides two convenient methods for this purpose: train_model() and train_models(). The former method models only one algorithm at a time, whereas the latter method models a list of algorithms. To get a list of algorithms supported, use the print_algorithms() function.
  5. Classify the data. In this step, we use the trained model to classify the data in the test sets. For this we use a function, classify_model() or classify_models(), based on the number of algorithms we have used to train our models.
  6. Find the analytics. This is one of the most important steps. Here we understand the results and the users can get the information on various parameters, such as by label, by algorithm, by document, and an ensemble summary. For performing this action, we have a function, create_analytics(). The amount of information provided by this method depends on the virgin flag: if the flag is true that means the test set is classified , if it is false it means the test set is unclassified .The summary provides a complete view of each algorithm's performance for each unique label in the classified data. This includes information such as precision, recall, f-scores, and the accuracy of each algorithm's results as compared to the actual data.

We can also understand other important parameters such as algorithm accuracy and ensemble agreement, where predictions are same when we use different algorithms. We can use create_ensembleSummary() for this purpose. We can use cross-validation (n-fold cross validation) to calculate the accuracy of each algorithm and exporting the labeled data.

For the purpose of understanding this package hands on, let us take our previous example of mail datasets. For simplicity, we are reading the text files into a corpus converting the corpus into a data frame and adding the labels to the last column of the data frame.

It is important that, when you are doing text analysis, all your categories/labels are in text format. It should be converted to numeric. Otherwise, it will fail in the create_analytics() call.

#Load the obama speech:
obamaCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/obama" , encoding="UTF-8"))

obamaDataFrame<-data.frame(text=unlist(sapply(obamaCorpus, `[`, "content")),stringsAsFactors=F)

obama.df <- cbind(obamaDataFrame , rep("obama" , nrow(obamaDataFrame)))
colnames(obama.df)[ncol(obama.df)] <- "name"

#Load the romney speech:
romneyCorpus <- Corpus(DirSource(directory = "D:/R/Chap 6/Speeches/romney" , encoding="UTF-8"))

romneyDataFrame<-data.frame(text=unlist(sapply(romneyCorpus, `[`, "content")),stringsAsFactors=F)

romney.df <- cbind(romneyDataFrame , rep("romney" , nrow(romneyDataFrame)))
colnames(romney.df)[ncol(romney.df)] <- "name"

# Combine both the speeches into one big data frame:
speech.df <- rbind(obama.df, romney.df)

speech_matrix <- create_matrix(speech.df["text"], language="english", weighting=tm::weightTfIdf)

speech_container <- create_container(speech_matrix,as.numeric(factor(speech.df$name)),trainSize=1:2000, testSize=2001:3857, virgin=FALSE)
speech_model <- train_model(speech_container,"SVM")

speech_results <- classify_model(speech_container,speech_model)

speech_analytics <- create_analytics(speech_container, speech_results)

speech_score_summary <- create_scoreSummary(speech_container, speech_results)

summary(speech_results)
SVM_LABEL    SVM_PROB     
 1:1554    Min.   :0.5000  
 2: 303    1st Qu.:0.7556  
           Median :0.8715  
           Mean   :0.8118  
           3rd Qu.:0.8715  
           Max.   :1.0000  
  summary(speech_score_summary)
SVM_LABEL   BEST_LABEL    BEST_PROB   NUM_AGREE
 1:1554    Min.   :1.000   1:1554    Min.   :1  
 2: 303    1st Qu.:1.000   2: 303    1st Qu.:1  
           Median :1.000             Median :1  
           Mean   :1.163             Mean   :1  
           3rd Qu.:1.000             3rd Qu.:1  
           Max.   :2.000             Max.   :1  
summary(speech_analytics)
ENSEMBLE SUMMARY
       n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
n >= 1 
                  1              0.16
ALGORITHM PERFORMANCE
SVM_PRECISION    SVM_RECALL    SVM_FSCORE 
            1             1             1 

#Let's try out multiple algorithms on the same data frame:
speech_multi_models <- train_models(speech_container, algorithms=c("MAXENT","SVM"))

speech_multi_results <- classify_models(speech_container,speech_multi_models)

speech_multi_analytics <- create_analytics(speech_container, speech_multi_results)

ensemble_summary <- create_ensembleSummary(speech_multi_analytics@document_summary)

precisionRecallSummary <- create_precisionRecallSummary(speech_container, speech_multi_results, b_value = 1)

scoreSummary <- create_scoreSummary(speech_container, speech_multi_results)

recall_acc <- recall_accuracy (speech_multi_analytics@document_summary$MANUAL_CODE,speech_multi_analytics@document_summary$MAXENTROPY_LABEL)

summary(speech_multi_results)
MAXENTROPY_LABEL MAXENTROPY_PROB  SVM_LABEL    SVM_PROB     
 1:1578           Min.   :0.5000   1:1558    Min.   :0.5000  
 2: 279           1st Qu.:0.5000   2: 299    1st Qu.:0.7537  
                  Median :0.5084             Median :0.8689  
                  Mean   :0.6933             Mean   :0.8092  
                  3rd Qu.:0.9561             3rd Qu.:0.8689  
                  Max.   :1.0000             Max.   :1.0000  
summary(recall_acc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1502  0.1502  0.1502  0.1502  0.1502  0.1502 
summary(scoreSummary)
MAXENTROPY_LABEL SVM_LABEL   BEST_LABEL    BEST_PROB   NUM_AGREE    
 1:1578           1:1558    Min.   :1.000   1:1584    Min.   :1.000  
 2: 279           2: 299    1st Qu.:1.000   2: 273    1st Qu.:2.000  
                            Median :1.000             Median :2.000  
                            Mean   :1.147             Mean   :1.933  
                            3rd Qu.:1.000             3rd Qu.:2.000  
                            Max.   :2.000             Max.   :2.000  
summary(precisionRecallSummary)
SVM_PRECISION   SVM_RECALL   SVM_FSCORE MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
 Min.   :1     Min.   :1    Min.   :1    Min.   :1            Min.   :1         Min.   :1        
 1st Qu.:1     1st Qu.:1    1st Qu.:1    1st Qu.:1            1st Qu.:1         1st Qu.:1        
 Median :1     Median :1    Median :1    Median :1            Median :1         Median :1        
 Mean   :1     Mean   :1    Mean   :1    Mean   :1            Mean   :1         Mean   :1        
 3rd Qu.:1     3rd Qu.:1    3rd Qu.:1    3rd Qu.:1            3rd Qu.:1         3rd Qu.:1        
 Max.   :1     Max.   :1    Max.   :1    Max.   :1            Max.   :1         Max.   :1        
summary(speech_multi_analytics)
ENSEMBLE SUMMARY
       n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
n >= 1                1.00              0.15
n >= 2                0.93              0.13

ALGORITHM PERFORMANCE
       SVM_PRECISION           SVM_RECALL           SVM_FSCORE MAXENTROPY_PRECISION    MAXENTROPY_RECALL    MAXENTROPY_FSCORE 
                   1                    1                    1                    1                    1                    1

Model evaluation

There are multiple metrics for model evaluation in case of binary class classification problems in machine learning. These metrics help us evaluate the performance of the model and also in the parameter tuning process.

Confusion matrix

How can we describe the performance of a classifier, that is, when we have trained a model and have test data which has all the values, how can we assess the classifier's performance using the test data? The confusion matrix comes to our rescue. In the field of machine learning, the confusion matrix is used to assess the performance of the classifier. This matrix is also called the error matrix or contingency table. The confusion matrix has a simple table structure which aids the user to visualize the performance of an algorithm and it is very simple to understand. This type of analysis is generally used in supervised learning.

Let me take the confusion matrix from my spam classifier. The following is the output of the confusion matrix from the classifier. The confusion matrix was created using a method:

confusionMatrix ( prediction, testOutcome);

This method is available in the R package Caret.

The following confusion matrix shows the classifier performance. It tells if the classifier correctly detected spam to spam and ham to ham.

N = 2068

Ham

Predicted Class

Spam

 

Actual Class

Ham

855

1

Spam

626

586

Let's dive deeply into the terminologies of the confusion matrix. The column specifies the actual class and the row specifies the predicted class.

  • True Positives(TP) : These are the mails that were ham and were detected as ham
  • True Negatives(TN) : These are the mails that were spam and were detected as spam
  • False Positives(FP) : These are the mails that were ham but were detected as spam
  • False Negatives(FN) : These are the mails that were spam but were detected as ham

We can visualize the preceding pointers in a table format for better understanding as follows:

N = 2068

Ham

Predicted Class

Spam

 

Actual Class

Ham

855(TP)

1(FP)

Spam

626(FN)

586(TN)

A lot of important information can be derived from the confusion matrix.

The True Positive Rate, also called sensitivity, can be derived using the following formula:

True Positive Rate = Confusion matrix

The True Negative Rate is also called specificity:

True Positive Rate = Confusion matrix

Precision can be calculated using:

Precision = Confusion matrix

Negative predictive value = Confusion matrix

Fallout Rate = Confusion matrix

Accuracy = Confusion matrix

The F1 score can be calculated using, this is a Harmonic mean of sensitivity and Precision:

F1 = Confusion matrix

ROC curve

Receiver Operating Characteristics Curve (ROC) is the plot between True Positive Rate and False Positive Rate of classification in a binary class problem, such as instance opinion mining, where classes are positive sentiments and negative sentiments. This curve depicts the performance of a classifier without taking the class distribution into the context.

ROC curve
install.packages('ROCR')


library(ROCR)
data(ROCR.simple)


pred <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
lines(x=c(0, 1), y=c(0, 1), col="black")
ROC curve

Precision-recall

Let's say we have a collection of documents, from which we have to retrieve documents that match a certain criteria. We query the collection based on that criteria and we get a list of matching documents. Our retrieval mechanism A returned 200 documents, out of which 60 were relevant, while another retrieval mechanism returned 100 documents, out of which 30 were relevant. We have the prior information that there are overall 400 relevant documents in the collection. How do we decide which of the mechanisms worked better? We need to look into the false positives and false negatives here. Recall in this context means the ratio of the number of relevant documents retrieved to the overall number of relevant documents in the collection.

Precision means the ratio of the number of relevant documents retrieved to the overall number of documents retrieved. So, for this example, Mechanism A has a recall of 60/400= 0.15 and precision of 60/200=0.3; while B has a recall of 30/400=0.075 and precision of 30/100=0.30. Clearly, Mechanism A worked better, as it has a high recall rate while the precision is the same for both.

perf1 <- performance(pred, "prec", "rec")
plot(perf1,colorize=TRUE)
Precision-recall
perf1 <- performance(pred, "sens", "spec")
plot(perf1,colorize=TRUE)
Precision-recall
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset