CHAPTER 5

Data Collection and Evaluation

In the previous chapters, we discussed various concepts and research opportunities in Blogosphere. This chapter specifically focuses on equally important aspects considering data collection and evaluation. Some of the concepts are difficult to evaluate using the conventional evaluation methods and require avant-garde evaluation strategies. We will also discuss some data preprocessing techniques commonly used in blog mining.

5.1 DATA COLLECTION

Data is an essential asset for performing any sort of analysis. In this section, we will discuss ways to collect data using Application Programing Interface (API) and crawlers. We will also point out some available benchmark datasets. A brief discussion will be presented regarding data preprocessing due to the noisy nature of the blogosphere.

5.1.1 API

Several social media sites like BlogCatalog, Technorati, del.icio.us, Digg, Twitter, etc. provide APIs that can be used to download part or whole of the real-world data. REST (REpresentational State Transfer) protocol is the most widely accepted protocol for these APIs, which differs from SOAP (Simple Object Access Protocol) due to the absence of session tracking as an additional messaging layer. These are often termed as RESTful APIs. Each API request is handled as a remote procedure call on the server and returns the requested portion of the data in the desired output format. It is specified as one of the arguments in the query. Most widely used formats include XML, JSON (JavaScript Object Notation), Javascript, and PHP. Usually, there are limits to the number of times one can make an API request (examples covered later in the chapter). These limits are manifested using an API key. These keys are uniquely generated for each developer. The developer is required to obtain the API key from the website and is required as an argument in the query. Under these constraints, it is rather impossible to collect all the data stored by these sites. So the best possible solution is to consider the collected data as the representative of the population and study properties of the blogosphere mentioned in previous chapters. Next, we study some specific API examples from popular social media sites and understand the returned data.

BlogCatalog (http://www.blogcatalog.com/api/) is a social community for bloggers, which is one of the largest blog directories on the internet. BlogCatalog helps in searching blogs, connecting with other bloggers, and learning and promoting your blog. Bloggers can submit their blog(s) to BlogCatalog. They can specify details such as blog URL, blog language, author name, blog country, and date when the blog was submitted to BlogCatalog. They can also specify metadata for their blog like category labels, which they think are the most appropriate for their blog, user-defined tags, and description of their blog. BlogCatalog automatically crawls the most recent 5 blog post snippets. Bloggers can provide tags for these blog post snippets. Members of BlogCatalog can rate the blogs submitted by fellow bloggers on a scale of 5 with 5 being the most popular. They can also comment on the blog/blog posts submitted by the blogger. Bloggers have the option to develop a social network with other members and can specify who are their friends. They can also specify the communities he/she is a part of and his/her favorite blogs. Blogcatalog also displays a list of 10 members who recently visited his/her blog. BlogCatalog provides a richly annotated source of blog data.

BlogCatalog provides two basic functions in its API: getinfo and bloginfo. The getinfo provides information about a BlogCatalog user. It requires two input arguments: an API key (bcwsid) and the username of the blogger whose information is to be downloaded. The returned information includes blogger’s real name, blogger’s details, last login attempt, name(s) of his (her) blog(s), their URL(s), their BlogCatalog URL(s), hits, views, rank, language, country, blogger’s friends, and related or neighboring blogs.

Example 5.1. The sample request for getinfo looks like:

 

http://api.blogcatalog.com/getinfo?bcwsid=[apikey] &
username=johndoe

and the sample response looks like:

 

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE bcapi PUBLIC "-//BlogCatalog//DTD
BCAPI 0.01//EN" "http://
 api.blogcatalog.com/dtd/bcapi-001.xml"> <bcapi version="1.0"> <result>  <user id="56848">johndoe</user>  <realname>John Doe</realname>  .  .  .  <weblogs>    <weblog id="4286052">      <name>The JohnDoe Blog</name>      <url>http://blog.johndoe.com</url>      <bcurl>http://www.blogcatalog.com/blogs/
     the-johndoe-blog.html     </bcurl>     .     .     .   </weblog>  </weblogs>  <friends>    <user id="56401">AngeldaVinci</user>    <user id="13309">SiteProPlus</user>    <user id="50995">thegoodknife</user>    .    .    .  </friends>  <neighborhoods>    .    .    .  </neighborhoods> </result> </bcapi>

The bloginfo query provides information about a blog. It requires two input arguments: an API key (bcwsid) and the BlogCatalog url in which you are interested. This request would return the name of the blog, actual URL, BlogCatalog URL, categories that this blog is listed under, user-defined tags of the blog, hits, views, rank, language, country, number of reviews, review details (including the reviewer name, rating, review date, review text), similar or neighborhood blogs, and recent viewers. Note that actual URl is different from BlogCatalog URL. For instance, an actual URL looks like, http://blog.johndoe.com/, whereas a BlogCatalog URL looks like, www.blogcatalog.com/blogs/the-johndoe-blog.html.

Example 5.2. The sample request for bloginfo looks like:

 

http://api.blogcatalog.com/bloginfo?bcwsid=
[apikey]&url= http://
www.blogcatalog.com/blogs/the-johndoe-blog.html

and the sample response looks like:

 

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE bcapi PUBLIC "-//BlogCatalog//DTD
BCAPI 0.01//EN" "http://
 api.blogcatalog.com/dtd/bcapi-001.xml"> <bcapi version="1.0"> <result>  <weblog id="4286052">    <name>;The JohnDoe Blog</name>    <url>http://blog.johndoe.com</url>    <user id="56848">JohnDoe</user>    <bcurl>        http://www.blogcatalog.com/blogs/
       the-johndoe-blog.html   </bcurl>   <categories>       <category>Blog Resources</category>       <category>Blogging</category>   </categories>   <tags>       <tag>annoucements</tag>       <tag>blogcatalog</tag>       <tag>news</tag>   </tags>   <review-count>4</review-count>   <reviews>       <review id="78005">           <rating>10</rating>           <realname>Jonathan-C. Phillips</realname>           <text-excerpt>this
          is so cool. . .</text-excerpt>           .           .           .       </review>       .       .       .    </reviews>    <neighborhood>       .       .       .    </neighborhood>    <recent-viewers       .       .       .    </recent-viewers </result> </bcapi>

More examples of various APIs available at different blog sites such as Technorati, Digg, del.icio.us, and Twitter are discussed in Appendix B.

5.1.2 WEB CRAWLER

Some blogs do not provide APIs to download data and most of them are not indexed by the social media sites including the ones mentioned above. In such cases, one could write crawlers, which are programs that automatically download webpages including blogs. A crawler can visit many webpages following the trail of hyperlinks, virtually moving from one webpage to another. The most important difference between conventional web crawlers and blog crawlers is that there is a continuous need for blog crawlers to download the most recent information as new content is added to the blogs. The conventional or web crawlers can be modified to make them work in an incremental fashion. First, we will briefly discuss how conventional or web crawlers work and then explain modifications to make them work incrementally.

A conventional or web crawler works in a sequential manner. It starts from a seed set of webpages (or URLs) and then uses the links within them to fetch other pages. This process continues until a stopping criterion is satisfied. Each time the crawler encounters a new URL, it adds the URL to a queue-based (FIFO) data structure known as “frontier”. Each time the crawler finishes downloading the webpage, it extracts a URL from the frontier and starts crawling a new webpage. This ensures a breadth-first traversal of the web. The downloaded webpage is parsed and stored in a repository or a database and the new URLs are added to the frontier. The crawler either stops once the frontier is empty or some other criterion is satisfied. Some applications require the crawler to download webpages that belong to certain type or topic. Such crawlers are called preferential crawlers. These crawlers crawl webpages and add those URLs to the frontier that belong to the particular topic or type. Sometimes the frontier is implemented as a priority queue rather than a first in first out (FIFO) queue to implement preferential crawlers.

Blog crawling is very similar to webpage crawling except that blogs are updated very frequently. This requires the crawler to continuously crawl these blogs and look for updates. Such crawlers, also known as incremental crawlers, look for updated information on the blogs. They are implemented the same way as conventional crawlers but they check whether the blog post is already crawled before crawling it. The blog posts are arranged in reverse chronological sequence so once a blog post is reached, which is already crawled, there is no point in going beyond that blog post. The crawler downloads the blog, post pages, parses them, and stores them in a repository or database. These parsers could be written using regular expressions that check for a particular HTML pattern. Once an occurrence of such a pattern is found, the HTML code can be stripped, and the data element is stored in the database. Let’s consider an example to illustrate this.

 

Example 5.3. HTML code snippet for a blog post title looks like as follows:

<div class="post">
<p class="filed-under">Filed under:
<a href="/category/iphone/">iPhone</a>,
<a href="/category/app-store/">App Store</a>,
<a href="/category/first-look/">First Look</a>,
<a href="/category/app-review/">App Review</a></p>
<h2 class="posttitle">
<a href="http://www.tuaw.com/2009/06/11/first-look
-get-home-for-iphone/" rel="bookmark">First Look: Get Home for iPhone</a> </h2> <p class="byline">by <strong><a href="/bloggers/cory-
bohon/">Cory Bohon</a></strong> on Jun 11th, 2009</p>

Now to extract the title, we can use the pattern <h2 class="posttitle">[\w\W]*?</h2>. This would extract everything enclosed within the h2 tag, which also includes some HTML code. This code can be trimmed using the regular expression <[\w\W]*?> to obtain the title of the blog post, i.e., “First Look: Get Home for iPhone”.

 

Another way to implement an incremental crawler is to parse the RSS (Real Simple Syndication) feeds. RSS feeds are well-formed XML files available at blogs. RSS feeds are automatically updated with the most recent entries or the blog posts published at the blog. The number of most recent blog posts that are listed in the RSS feeds vary for different blogs, but usually RSS feeds contain 15 most recent blog posts. After a blog is crawled using the conventional web crawler, one can subscribe to the RSS feeds and update the blog data as soon as new RSS feeds are available.

5.1.3 AVAILABLE DATASETS

Besides downloading datasets using APIs and crawling, there are datasets available for research and public usage. Some examples are:

Social Computing Data Repository hosts data from a collection of over 20 different social media sites that have blogging capacity. The data can be accessed from http://socialcomputing.asu.edu/. Some of the prominent social media sites included in this repository are BlogCatalog, Twitter, MyBlogLog, Digg, StumbleUpon, del.icio.us, MySpace, LiveJournal, The Unofficial Apple Weblog (TUAW), Reddit, etc. The data is continuously crawled daily at pre-specified time using APIs, crawlers, and RSS feeds. At the time of writing the total data repository size was estimated to be around 15 GB. The repository contains various facets of blog data including blog site metadata like, user defined tags, predefined categories, blog site description; blog post level metadata like, user defined tags, date and time of posting; blog posts; blog post mood (which is defined as the blogger’s emotions when he/she wrote the blog post); blogger name; blog post comments; and blogger social network.

Spinn3r: The dataset, provided by Spinn3r.com [67], is a set of 44 million blog posts made between August 1st and October 1st, 2008. A post includes the text as syndicated, as well as metadata such as the blog’s homepage, timestamps, etc. The data is formatted in XML and further arranged into tiers approximating to some degree of search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed). This dataset spans a number of big news events (the 2008 Olympics, both US presidential nominating conventions, the beginnings of the financial crisis, etc.) as well as many others you might expect to find in blog posts.

Nielsen Buzzmetric Dataset: This dataset (http://www.icwsm.org/format.txt) contains roughly 14 million blog posts from 3 million blog sites collected by Nielsen Buzzmetrics in May 2006. The dataset contains about 1.7 million blog-blog links. The blog posts are written in several languages, with its majority (51%) being in English. The complete dataset is over 10 GB. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type. However, half of the blog outlinks are missing. The data is in XML format.

TREC Blog Dataset: TREC blog dataset (http://trec.nist.gov/data/blog.html) contains a crawl of 100,649 feeds during late 2007 and early 2008. The feeds were polled once a week for 11 weeks. A total number of collected feeds is: 753,681.10,615 feeds were collected daily on average. The size of the dataset is 38.6 GB uncompressed, and 8 GB compressed. It also contains reasonably sized spam.

5.1.4 DATA PREPROCESSING

Given a dataset, an essential step before any data mining task is to preprocess the data and remove the irrelevant part (e.g., noise) as much as we can. Here we will present some basic data preprocessing approaches that help improve quality.

Stopwords are the commonly occurring strings that have meaning in a specific language or they can be a token that does not have linguistic meaning. For example, in the English language, words such as “a”, “and”, “is”, “the”, etc. are termed as stopwords. These words do not contribute significant information in data mining or information retrieval tasks. There is no standard list of stopwords. Often it is controlled by human input. Although some lists can be found at http://www.textfixer.com/resources/common-english-words.txt, or http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/. Stopword elimination from the dataset is one of the essential steps in data preprocessing and helps in drastically reducing the noise.

Stemming is the process of reducing the words to their stem or root form. The process of stemming is often known as conflation. For example, a stemming algorithm would reduce words like “fisher”, “fishing”, “fish”, “fished” to the word “fish”. It is not essential for a stemming algorithm to reduce a word to its morphological root; however, it is of utter importance that all the inflected or derived words map to the same stem. Porter stemmer is one of the widely used stemming algorithms (http://tartarus.org/$sim$martin/PorterStemmer/). Stemming helps a lot in reducing the dimensionality of text contained in blog posts.

Colloquial Language Usage - Often due to the casual nature of the blogosphere, bloggers use colloquial forms of language. Apparently features that seem noisy might be informative. Services like UrbanDictionary (http://www.urbandictionary.com) can be used to unravel the informative content in the slangs, abbreviations, and/or colloquial forms of language used by the bloggers.

Feature Ranking - Considering the highly textual nature of blogosphere, it is often desirable to reduce the dimensionality. Typically the number of commonly used words from English vocabulary ranges approximately 30K, but a single blog cannot use all the words. So term-vector representation of blogs could be extremely sparse and leads to the curse of dimensionality. Researchers often employ term ranking approaches like tf-idf [68], feature selection [69], and feature reduction approaches such as latent semantic indexing (LSI) [70], probabilistic latent semantic indexing (PLSI) [71], and latent dirichlet allocation (LDA) [72]. Such techniques help in alleviating the effect of the curse of dimensionality, enhancing generalization capability, speeding up learning process, and improving model interpretability. The feature selection approaches pick the features that are most informative using measures like information gain, minimum description length, etc. The feature reduction approaches transform the feature space from term space to concept space, hence reduced dimensionality.

In a casual environment like the blogosphere that nurtures sentiments, expressions, and emotions through writing, it is much more prevalent to observe intentionally modified spellings such as “what’s uppppp?” and “this is so cooooool. .”. These instances demonstrate examples of intonation1 in written texts. These examples through misspellings clearly emphasize stress on the emotions and convey more information than the regular text. It would be undesirable to consider them as sheer misspellings and replace them with the correct spellings. To the best of our knowledge, till the writing of this book none of the text analysis approaches consider this phenomenon. But clearly it has a tremendous potential and presents a great promise for further exploration.

5.2 EVALUATION

In order to measure the difference an algorithm makes over the existing counterparts, it is necessary to systematically evaluate the merit, worthiness, and significance of the algorithm. This process is called evaluation. The algorithms or the proposed techniques are evaluated using a set of standard criteria. However, sometimes the existing criteria could not be used in evaluating algorithms or techniques in the context of the blogosphere. For instance, evaluating concepts like influence presents a big challenge due to the absence of ground truth. Evaluation models based on training and testing data fail in such situations. Performing human evaluations through surveys looks like the only solution for this challenge. However, human evaluation presents bigger challenges such as funding and recruiting unbiased yet representative users. High costs and long time are another constraints.

In this section, we will discuss some standard criteria that can be used to evaluate approaches or techniques in the blogosphere research. We will also discuss some novel and avant-garde evaluation strategies for evaluating concepts like influence, where conventional evaluation criteria fall short.

5.2.1 BLOG MODELING

A predominant way to evaluate a proposed model for blog dataset is to compare the aggregate statistics of the proposed model with the actual dataset. Some of these statistics also described in Chapter 1 include average degree, indegree and outdegree distribution, degree correlation, diameter, largest weakly connected component size, largest strongly connected component size, clustering coefficient, average path length, reciprocity, etc. The data can be divided into training and testing parts. Using the training dataset, we can learn the model parameters through maximum likelihood estimation and observe the goodness of fit of the model on the test dataset in terms of the statistics mentioned above.

 

Example 5.4. Using Albert-Barabasi’s scale-free network model, we generated a 25,000 node network. The model started with 2 nodes and with minimum node degree 60. New nodes were added to the model until we had 25,000 nodes in the network. Now to test whether it simulates the blog network, we compare the aggregate statistics in Table 5.1. The statistics for the blog dataset were derived from the blog network crawled from Blogcatalog.com. The crawled network comprised of 23,566 nodes.

Table 5.1: Comparison of aggregate statistics for a blog network crawled from Blogcatalog.com and a scale-free network generated using Albert-Barabasi model.

Images

5.2.2 BLOG CLUSTERING AND COMMUNITY DISCOVERY

Since clustering and community discovery are quite close to each other in terms of the homophily assumption, their evaluation is also done in a similar way. The underlying assumption for both clustering and community discovery is that similar or like-minded individuals interact more frequently as opposed to others. This phenomenon is used to evaluate the performance of clustering algorithms or community extraction algorithms. Essentially, we look at within cluster and between cluster distance. An algorithm that gives the smallest within-cluster distance and largest between-cluster distance is considered to be the best. A small within-cluster distance ensures cohesive clusters and a large between-cluster distance ensures that dissimilar clusters are well separated. For multiple clusters, we compute the average within-cluster distance (Wd) and average between-cluster distance (Bd). The ratio σ = Wd/Bd is computed for different algorithms and the algorithm whose ratio is minimum is chosen as the best algorithm. σ can also be computed for community discovery algorithms, and this measure can also be used to evaluate the community discovery algorithms. σ is defined as:

Images

In the above formula, ci, cj represent two different clusters i and j; vm, vn are two different vectors representing two different bloggers; k, the number of clusters, varies from 2 to the total number of blogs, i.e., ||D||; and d(vm, vn) gives the distance between two bloggers m and n, represented by the corresponding vectors vm and vn. Distance between two vectors can be computed as the inverse of the cosine similarity between the two vectors.

An important point is the between-cluster distance calculation. There are several ways of distance computation:

• Cluster Mean/Median Distance: Distance between different clusters could be computed using the cluster centroids or the cluster median. For clusters, ci and cj, their cluster mean or median could be computed as Images and Images. The distance between these cluster means or medians is considered as the distance between two clusters. The distance between cluster medians is resistant to noise or cluster outliers whereas the distance between cluster means or centroids is not. However, this approach to compute distance requires computing the mean or median of the clusters.

• Single Linkage: Distance between two clusters, ci and cj is considered as the shortest pairwise distance between the cluster members of ci and cj. Single linkage is highly sensitive to noise or outliers. This scheme suffers from the chaining phenomenon, which means that clusters may be considered close to each other due to single elements being close to each other, even though many of the elements in each cluster may be very distant to each other.

• Complete Linkage: Distance between two clusters, ci and cj is considered as the largest pairwise distance between the cluster members of ci and cj. Complete linkage is highly sensitive to noise or outliers. Two clusters may be considered distant from each other due to single elements being far from each other, even though many of the elements in each cluster may be very close to each other

Images

Figure 5.1: An example of cluster evolution. A cluster r consisting of 100 members at time t splits into 2 clusters, i = 1 and i = 2, consisting of 40 and 60 members, respectively, at time t + 1.

• Average Linkage: Distance between two clusters ci and cj is considered as the average pairwise distance between the cluster members of ci and cj. Average linkage is resistant to noise or cluster outliers.

Some clustering algorithms study the evolution in the clusters in the blogosphere. As mentioned in Chapter 2, the clusters are dynamic, which means that bloggers’ interests drift over time that leads to their change in the membership of the clusters. To measure this drift of the bloggers’ interests, authors in [74] propose a measure called blogger entropy. It measures the dispersion of the bloggers in one cluster throughout the clusters of the next time window. For a fixed value of k, i.e., the number of clusters, if many bloggers in a cluster at time t still remain in the same cluster in time t + 1, then the blogger entropy approaches 0. Conversely, if the bloggers in a cluster at time t are spread equally among many clusters at time t + 1, blogger entropy approaches 1. Blogger entropy Ur for a cluster r can be computed as:

Images

where nr is the number of bloggers that are in the dataset at time t + 1 and were members of the cluster r at time t, Images is the number of bloggers from cluster r at time t contained in cluster i at time t + 1, and q is the number of clusters at time t + 1 among which the bloggers of cluster r at time t are distributed.

 

Example 5.5. To illustrate the above measure for evolutionary clustering, we study an example of a cluster r with 100 members at time t splitting in two clusters i = 1 and i = 2 of 40 and 60 members, respectively, at time t + 1. For the sake of simplicity, we assume that no member leaves and no new member joins. This is illustrated in Figure 5.1. Here nr = 100, Images = 40, and Images = 60. So,

Images

A high value of Ur (close to 1) indicates a big change in the cluster from time t to t + 1.

5.2.3 INFLUENCE AND TRUST

As we mentioned earlier in Chapter 3, both influence and trust are subjective concepts, which presents unique challenges in evaluation due to the absence of ground truth. In these cases, researchers often resort to a time consuming choice of human evaluation (e.g., focus groups). It is not only time consuming to acquire unbiased human evaluations but also usually expensive. An interesting alternative evaluation strategy is presented in [44]. It proposes to leverage the power of social media in evaluation. Specifically, Digg was used to evaluate how influential blog posts are. Every story that appears on Digg is voted by the community. These votes are denoted by “diggs”. The more diggs a story gets, the more popular it is. However, all posts/stories are digged or only those that leave impressions receive diggs. Hence, diggs of a post can be approximated as an influence measure. The diggs assigned to these stories are purely community contributed. These scores can be used to rank these stories based on their influence on the members of the community. This can also be treated as a survey where the human subjects voluntarily rank the stories. Using the scores on Digg, we can generate the ground truth or the gold standard against which various models can be evaluated.

Once a ground truth is obtained by either using human evaluation or social media, the evaluation of two models reduces to computing rank correlations between the two ordered lists, the one obtained by the ground truth and the other generated by the model. Note that models that compute trust scores for blog posts or blogs can also be evaluated using the measures mentioned below. Various measures like Normalized Discounted Cumulative Gain (NDCG), correlation measures such as Spearman’s rank correlation and Kendall tau distance can be used. Next, we discuss these in detail.

NDCG measures the relevance of the rankings of entities such as blog posts or blogs. It allows multiple levels of relevance (r) for the entities such as irrelevant (0), borderline (1), and relevant (2). These relevance scores are defined a priori. We assume three levels of relevance for the sake of simplicity. The levels of relevance depend on how precisely one wants to compare the results of two algorithms.

Table 5.2: Two ranking algorithms RF1 and RF2 evaluated against the ground truth GT using NDCG measure as presented in Example 5.6.

Images

Each relevant blogs or blog post contributes some gain that is cumulated, also known as Cumulative Gain (CG) and is computed as:

Images

Here d1, . . . , dn denote n blogs or blog posts and ri denotes the relevance of the bog post or blog ranked at i-th position. Gains from low ranked documents are discounted, also known as Discounted Cumulative Gain (DCG). This ensures more relevant blogs or blog posts are ranked higher by the ranking algorithm.

Images

The final score is normalized by the maximum score possible (MaxDCG), finally known as Normalized Discounted Cumulative Gain (NDCG). Note that R denotes the ranking of the blog posts or blogs obtained through the ground truth.

Images

Example 5.6. Consider the example in Table 5.2; there are 4 documents or blog posts d1, d2, d3, and d4. The ideal ranking is shown in the first column “Ground Truth”. Two other ranking algorithms are denoted by RF1 and RF2 in the second and third columns, respectively.

Images

Images

Hence, NDCGRF1 = 1.00 and NDCGRF2 = 0.9203. This exemplifies that NDCG decreases if a more relevant document is pushed down in the rank. NDCG can capture the change in ranking order.

Correlation Measures in statistics describe the relationship between two different rankings on the same set of items. A rank correlation coefficient measures the correspondence between two rankings. We discuss two most commonly used measures: Spearman’s correlation coefficient (ρ) and Kendall tau distance (τ).

Spearman’s correlation coefficient computes the similarity between the rankings of the corresponding items of a set. It is computed as:

Images

where δi denotes the difference between the ranks of the corresponding items of the two ordered lists, and n represents the total number of items in the ordered lists. Note that Spearman’s correlation coefficient compares two ranked lists of the same size with the same items. ρ lies in the interval [−1, 1]. ρ = 1 implies a perfect agreement while ρ = −1 implies a perfect disagreement. We illustrate the computation of Spearman’s correlation coefficient with an example below.

Table 5.3: Conversion of documents to their rankings based on ground truth GT and ranking algorithms RF1 and RF2.

Images

Example 5.7. Continuing with the same example of the four documents d1, d2, d3, and d4 in Table 5.2. We will compute the Spearman’s correlation coefficient for ranking algorithms RF1 and RF2 with respect to the ground truth GT. First we convert the document list to their rankings. The converted rankings are displayed in the Table 5.3. It shows the converted rankings for the ground truth (GT) and ranking algorithms RF1 and RF2. It also shows the difference (δi(RF1) and δi(RF2)) in the ranking of corresponding items for RF1 and RF2 with respect to GT, respectively. Hence,

Table 5.4: Computing the discordant pairs between GT and RF1; and GT and RF2 for Example 5.8.

Images

Images

This shows that RF1 is a better ranking algorithm than RF2.

Kendall tau measures the dissimilarity between two ranked lists in terms of the ranks of their corresponding items. It is computed as

Images

where n is the total number of items in the ranked list, and P is the total number of discordant pairs in the two ranked lists. In order to calculate the number of discordant pairs, we pair each item with every other item and count the number of times the values in the first list are in the opposite order of the values in the second list. This is exemplified in Table 5.4. For GT and RF1 document pair d4 and d3 is the only discordant pair as evident from the disagreement between the sign of difference in ranks (−1 and 1). Kendall tau distance is normalized by the total number of possible pairs n(n − 1)/2. The value of τ lies in the interval [0, 1]. τ = 0 implies perfect agreement while τ = 1 implies perfect disagreement. Let’s illustrate the computation of Kendall tau distance with the help of an example.

Example 5.8. Continuing with the example of the four documents d1, d2, d3 and d4. We look at the first three rows of the Table 5.3. Table 5.4 illustrates the computation of discordant pairs between GT and RF1 and GT and RF2. The number of discordant pairs between ranked list generated by GT and RF1 is 1 (counted by the number of sign disagreements between the difference in rank of the document pairs in Table 5.4). Hence PRF1 = 1. Similarly, the number of discordant pairs between ranked list generated by GT and RF2 are 2. Hence PRF2 = 2. So,

Images

This also shows that RF1 is a better ranking algorithm than RF2 since the shorter the distance the better.

Examples 5.6, 5.7, and 5.8 demonstrate NDCG’s emphasis on the position of relevant documents as opposed to merely looking at their position in the case of rank correlation measures. Both Spearman’s correlation coefficient and Kendall tau distance showed that RF1 is not as good as the ground truth GT although document d4 and d3 have the same relevance (2). Example 5.6 shows that interchanging the position of d4 and d3 in the ranking would not make any difference.

5.2.4 SPAM

As we mentioned earlier in Chapter 4, spam (splog) filtering is a typical classification task, and we evaluate the various approaches based on the metrics used to evaluate classification algorithms, such as precision, recall, and f-measure. The evaluation of classification algorithms often involves a training and test model. A dataset is split into training and testing portions. An algorithm learns on the training portion of the dataset and is evaluated on the testing portion.

For the task of filtering spam, the instance of a dataset is pre-annotated as “spam” or “nonspam”, “positive” or “negative”, respectively. The classification algorithm treats the “spam” instances as the positive class and “non-spam” instances as the negative class. A classifier is learned using the training data and predicts the class labels of the test dataset. We can compute statistics such as true positives, true negatives, false positives, and false negatives.

True Positives (TP): Those instances that are classified as positive and actually belong to the positive class are called true positives. In terms of spam filtering, true positives are the splogs that are correctly classified as splogs by the classifier. These come under the correct classifications.

True Negatives (TN): Those instances that are classified as negative and actually belong to the negative class are called true negatives. In terms of spam filtering, true negatives are the nonspam blogs that are correctly classified as non-spam blogs by the classifier. These come under the correct classifications.

False Positives (FP): Those instances that are classified as positive but actually belong to the negative class are called false positives. In terms of spam filtering, false positives are the non-spam blogs that are misclassified as splogs by the classifier. They are a part of the misclassifications.

Images

Figure 5.2: Classification of blogs as spam (splogs) and non-spam. The blue ellipse represents the set of actual splogs and the green sphere represents the predicted splogs. Everything except the blue represents the actual non-spam blogs.

False Negatives (FN): Those instances that are classified as negative but actually belong to the positive class are called false negatives. In terms of spam filtering, false negatives are the splogs that are misclassified as non-spam blogs by the classifier. They are a part of the misclassifications.

The above statistics are illustrated in a venn diagram shown in Figure 5.2. Based on these four statistics, every classifier can be evaluated using precision and recall. Precision can be defined as:

Images

and recall is defined as:

Images

If a classification algorithm makes a single prediction which is correct, then the precision is 100% but recall could be low. On the contrary, if the classification algorithm predicts all the instances as splogs, then the recall is 100% but precision could be low. So it is required to compare classification algorithms in terms of both precision and recall. One way is to compute the harmonic mean of precision and recall, which is commonly known as f-measure.

Images

Let’s look at an example to compute the above mentioned statistics and precision, recall and f-measure for a classification algorithm.

Table 5.5: Confusion matrix for the Example 5.9.

Images

Example 5.9. Refer to the confusion matrix illustrated in Table 5.5. There are actually 10 splog instances and 10 non-spam blog instances, but the classification algorithm predicts 11 splog instances and 9 non-spam blog instances. Let’s compute the precision, recall, and f-measure of this classifier. Here TP=7; TN=6; FP=4; FN=3. Hence,

Images

1Intonation is a linguistic concept that refers to the different meanings conveyed by the different ways of pronunciation of a word [73]. The listener could interpret different meanings based on the prosodic utterance. Intonation is as common in written texts as it is in spoken language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset