CHAPTER 4

Spam Filtering in Blogosphere

Open standards and low barriers to publication have made social media, specially Blogosphere a perfect “playground” for spammers. Spammers post non-sensical or gibberish text to many blog sites that not only degrades the quality of search results but also consumes valuable network resources. These blogs are called spam blogs or splogs. More formally, splogs can be defined as [58] “a blog created for any deliberate action that is meant to trigger an unjustifiably favorable relevance or importance, considering the blogs’ true value”. As the blogosphere continues to grow, the utility of blog search engines becomes more and more critical. They need to understand the structure of the blog sites to identify the relevant content for crawling and indexing, and more importantly, to tackle the challenging issues of splogs.

Spammers take advantage of the anonymous or guest access to post content on blog sites. Often, they create fake blogs containing either gibberish text or hijacked content from other blogs or news sources. One main objective behind this type of splogs is to host content-based advertisements, which would generate revenue if visitors accidentally click on the advertisements. Spammers also create fake blogs to host link farms, with the purpose of boosting the rank of the participating sites in the link farm. Another most popular way of spamming in blogs is through comments. Spammers use the feature of comments on the blogs as a way of publicizing their blogs or websites, which are otherwise irrelevant to the original blog. These are also known as the spam comments. This has made the life of spammers much easier. Instead of setting up a complex set of webpages that foster a link farm, spammers write a simple agent that visits random blogs and wikis and leave comments containing links to the spam blog. Filtering splogs and spam comments would help improve search quality; reduce wastage of storage space by discarding the splogs and spam comments in search engine index and crawlers; also, reduce the wastage of network resources.

Often blogs are setup to relay a short message or ping to the servers whenever they are updated with new content. These servers could be search engines that index these blogs for the most up to date results. Spinn3r (http://www.spinn3r.com) is one such search engine that receives these pings from blogs to update its search index. In April 2009, Spinn3r (http://spinn3r.com/spam-prevention) reported that they receive nearly 100, 000 pings per second from spammers, which forms 93% of the total pings received. This indicates the phenomenal growth of splogs. In a similar report by Akismet (http://akismet.com/stats/), the blogosphere was reported to have 10, 085, 056, 032 total spam comments till March 2009, out of which only 2, 005, 536, 845 were legitimate comments, i.e., 80.115% of all comments are spam. The phenomenal growth of spam comments is presented in the Figure 4.1(a) and (b). Figure 4.1(a) depicts the daily count of spam and legitimate comments as recorded by Akismet’s (http://akismet.com/stats/) servers. Figure 4.1(b) depicts the cumulative or the total number of spam comments received till March 2009. This clearly shows an exponential growth in the number of spam comments in the blogosphere.

Images

Figure 4.1: Numbers of spam comments tracked by Akismet (http://akismet.com/stats/) till March 2009 (a) daily count, and (b) total or cumulative. Copyright © Automattic, Inc. Used with permission.

Such a high volume of splogs and spam comments presents tremendous challenges in finding the legitimate blog posts and in blog search. Blog search engines are not only faced with overwhelming blog data but most of which is spam, which exacerbates the problem of efficiently finding the relevant and accurate search results. Furthermore, splogs often contain legitimate text scraped from normal blogs or news sources. Such splogs evade most of the existing spam filtering techniques that lookout for gibberish or random sequence of characters as text.

A spam filter is a program that keeps the legitimate content in and filters out the illegitimate or spurious content. In this case, the illegitimate content corresponds to the splogs. Spam filtering can be considered as a classic supervised learning problem with binary class labels. Splogs are treated as the positive class and non-spam blogs are the negative class. A spam filter is thus a classifier that is trained to predict the class labels for the blogs as positive (splog) or negative (non-spam blog). A correct classification could belong to either of the two categories: true positive (a splog, which is correctly classified as splog) and true negative (a non-spam blog, which is correctly classified as non-spam blog). A misclassification could also belong to either of the two categories: false positive (a non-spam blog, which is classified as splog) and false negative (a splog, which is classified as a non-spam blog). This is discussed in more detail in Chapter 5. A classification error could be disastrous. Where false positives can seriously affect the blogger’s recognition and reputation, false negatives would hurt the search engine’s efficiency and accuracy. So there is a pressing need to reduce and balance the classification errors.

There has been a considerable amount of work that deals with the web spam [59]. Let’s look at the differences between splogs and the typical web spam, and examine why the approaches widely studied and used in these domains fall short in the blogosphere. First and foremost, the ease of use and open standards of the blogosphere makes it much easier for spammers to generate splogs and spam comments as opposed to web spam. Spammers write agents that create splogs or visit popular blogs randomly and leave spam comments. Whereas in the case of Web spam, they have to create a complex set of webpages that link to the spam webpage. Webpages do not have an option for leaving comments so they do not deal with the issues of spam comments. Second, the dynamic environment and phenomenal growth of the blogosphere defeats the spam filters that lookout for specific keywords since the splogs keep evolving and often they contain seemingly legitimate but irrelevant text. Webpages are relatively more static as compared to blogs so the content on webpages do not evolve that much. Third, due to the casual environment of the blogosphere, bloggers often use colloquial language, slangs, misspellings, and/or acronyms in their blogs. Traditional web spam filters might mistake “ham” as spam leading to a lot of false positives. These differences warrant a special treatment for splogs and spam comments that we discuss next. In this chapter, we will look at some specialized spam filtering techniques that remove splogs and spam comments using network or link and/or content.

4.1 GRAPH BASED APPROACH

Blogosphere can be represented as a network or graph of blogs as nodes and hyperlinks as vertices. We looked at an example of such a blog network in Chapter 1 in Figure 1.1(b). Given the blog network, we can study various statistics like degree distribution, clustering coefficient, and many others, as mentioned in Chapter 1. It has been shown that these statistics are considerably different for splogs and legitimate blogs [60]. These differences could be leveraged to identify splogs. For instance, the indegree for a splog does not follow densification law, i.e., over time the increase in inlinks for splogs would observe a sudden decrease [61]. Such indicators can empirically help identify splogs.

A more sophisticated measure to identify splogs leverages the correlation of increment in indegree of blogs and the blog’s popularity on a search engine. The assumption here is if a blog is returned among the top search results by a search engine and if it is a legitimate blog, then it will likely attract more inlinks. If the blog was a spam and it somehow managed to get into the top results returned by the search engine through link farming or other tactics, there would be very little or no increase in the inlinks of the splog.

The above mentioned statistics could be studied at a finer granularity by considering the blog post network as mentioned in Figure 1.1(a) of Chapter 1. Each blog post is assigned a spam score based on obtained statistics. The blog is then considered as a sequence of spam scores assigned for the comprised blog posts. These spam scores are aggregated using any of the available merge functions such as: average, max, min, etc. To counterattack the evolving tactics used by the spammers, we can retain or place higher weights on the spam score of the more recent blog posts. The spam scores from the blog posts are combined with the above statistics for a more reliable splog filtering algorithm. However, blogs are usually sparsely linked, which presents challenges to those degree-based approaches. Some approaches attempt to densify the link structure by inferring implicit links based on the similarity in content, but more work is needed to solve various challenges considering the information quality aspects of the blogosphere due to the casual environment.

Images

Figure 4.2: Some common network motifs of a blogger triad that accounted for spam comments.

As mentioned before, the blogosphere suffers not only from the splog issue but also from spam comments. Spammers exploit the unique feature of blogs to leave comments anonymously to gain unfair advantages. They can post links in their comments pointing to their blogs that are irrelevant to the discussion, creating link spam. Their intent is to advertise or create link farms pointing to splogs, hence promoting their blogs or webpages on search engines. Since comments usually do not have much text, content-based spam filtering approaches are likely to perform poorly in spam comments identification, resulting in both false positives and false negatives. A promising approach to identify spam comments is proposed in [62] that generates a blogger network based on the bloggers’ commenting behavior. Network motifs that characterize spam commenting behaviors are then identified and used to distinguish spam blogger from a regular blogger.

A blogger network based on commenting behavior can be generated using the following procedure: consider two bloggers bi and bj. If bj posts a comment on bi’s blog post that has a link pointing to bj’s blog post or blog then an edge from bi to bj is constructed. This network could be a weighted graph where the weight on an edge denotes the number of such comments. The study is performed on a Persian blog (http://www.persianblog.com). The blogs along with the blog posts and the comments are crawled. 700 of the comments are annotated manually using 4 human evaluators with interagreement between them being 0.96. The comments were annotated as “positive”, “negative”, “neutral”, and “spam”. Positive comments are those that encourage or support the blogger’s views in the blog post. Negative comments have an opposing response to the views expressed in the blog post. Neutral comments are those that have no special sentiment towards the blog post. Spam comments are those that are irrelevant comments with an invitation to the commenter’s blog/blog post or webpage. The first three classes of the comments correspond to non-spam comments.

After constructing the blogger network as described above, triads of bloggers with their commenting behavior were observed. From the observation, the most common triads that accounted for the spam comments are depicted in Figure 4.2. The dark edges denote the spam comments and lighter edges denote non-spam or regular comments. Based on this motif, a blogger bi is highly likely (in nearly 74% cases) to be a spammer if:

1. bi places comments on two other bloggers, bj and bk,

2. bj and bk never place comments on each other, and

3. bj and bk never place comments on bi.

By studying these triads, we can determine if bi with the relationships to bj and bk is a spammer or not. The computation of triads can be done very efficiently.

4.2 CONTENT BASED APPROACH

Content based splog filtering approaches treat the problem of identifying spam blogs as a binary classification task. Each blog is assigned one of the two labels: spam or ham. Based on an annotated dataset, a classifier is learned using a portion of the dataset as the training dataset. Remaining portion of the dataset is used as the test dataset to evaluate the efficiency and accuracy of the splog filtering algorithm. A blog post could be broken into various blocks: user-assigned tags, blog post content, comments, blog post title, blog post hyperlinks and anchor text. A separate classifier can be trained for each segregated block. However, this is very similar to web spam filtering approach as proposed in [63]. Due to the differences between the web spam and the blog spam mentioned earlier in this chapter, existing web spam filtering techniques do not perform well. We show how to exploit these differences and unique features of splogs to efficiently distinguish between splogs and regular blogs.

Usually splogs are machine generated pieces of text, which are either random sequence of characters that look like gibberish or blocks of legitimate text scraped from regular blogs or news sources. Content based splog filtering approaches like [58] leverage this observation to distinguish splogs from regular blogs. It has been observed that blog posts in spam blogs exhibit self-similarity over time in terms of content, links, and/or posting times. Blog posts in these splogs contain repetitive patterns in post content, affiliated links, and posting time. This peculiarity of splogs is exploited to train a splog classifier.

Self-similarity between posts i and j of a blog is defined as Sα(i, j), where α is one of the three attributes: content, time, and link. Each blog post i is represented as a tf-idf vector Images containing the top-M terms. The content self-similarity between i and j is represented as:

Images

The link self-similarity, Sl(i, j) is also computed the same way. Except constructing the tf -idf from terms, Sl(i, j) is constructed by the domain names of the hyperlinks. Self-similarity based on time, St(i, j) is computed by looking at the difference in the posting times as follows:

Images

where ti and tj are the posting times of blog posts i and j, and δday denotes the number of time units in a day. The units are decided by the timestamp of the blog post. If the timestamp is as detailed as seconds then the time unit considered is seconds.

For a blog with N posts, a self-similarity matrix is of N × N. The blog posts are arranged in increasing order of time, so a blog post i is published before i + 1. The self-similarity matrices thus obtained show interesting insights for splogs as follows:

Content: Splogs often have unusual topic stability while normal bloggers tend to drift over time. This is because the posts in the splogs often have repetitive blocks of text copied from other blogs or news sources.

Publication Frequencies: Normal bloggers publish blog posts regularly during few time periods, morning or night. They do not submit blog post throughout the day. However, splogs are generated by bots that are published either at same time or uniform distribution of time (throughout the day).

Link Patterns: Normal bloggers use different links as they blog on various topics. Blog posts in spam blogs often use the same or a set of links throughout.

These observations could be used to differentiate between splogs and regular blogs.

Another observation helpful to filter spam comments is the language model disagreement. Usually the spam comments do not link to semantically related blog posts or webpages. This divergence in the language models is exploited to effectively classify spam comments from non-spam comments [64]. Given a blog post and comments on the blog post, language models for each of them are created. The language model for each comment (Θ1) is compared with the language model for the blog post (Θ2) using the KL-divergence as follows:

Images

The KL-divergence gives the difference between the two distributions. The two distributions here are the language model for each comment and the language model for the blog post. If the KL-divergence value is below a threshold then the comment is considered as a regular comment or a non-spam comment, otherwise the comment is treated as a spam comment.

Blog comments can be very short, which is often the case. This could be true for some blog posts too. This results in a very sparse language model with few words. So in order to enrich the model, to achieve more accurate estimation of the language model for both the blog post and the comment, links in the blog post and the comments could be followed and the content found on these links could be added to the existing language models. These links could be followed to a certain depth which would add more content and eventually enrich the language models. Nevertheless, this also leads to the issue of topic drift and hence the language model drift. For the case of blog posts, inlinks to the blog post could also be followed to enrich the language model for the blog post.

Table 4.1: Specialized features derived from the content appearing on the blogs.

 

   No.

Feature Description

    1.

Location Entity Ratio

    2.

Person Entity Ratio

    3.

Organization Entity Ratio

    4.

Male Pronoun Entity Ratio

    5.

Female Pronoun Entity Ratio

    6.

Text Compression Ratio

    7.

URL Compression Ratio

    8.

Anchor Compression Ratio

    9.

All URLs Character Size Ratio

   10.

All Anchors Character Size Ratio

   11.

Hyphens Compared with Number of URLs

   12.

Unique URLs by All URLs

   13.

Unique Anchors by All Anchors

A significant advantage of content-based splog filtering approaches is that they do not suffer from sparse link structure of the blogosphere. This is also the reason that existing web spam filtering approaches that work on link-based approaches [59, 65] do not perform well in the blogosphere.

4.3 HYBRID APPROACH

Looking at the qualities of individual approaches: content-based and link-based, we now study a hybrid approach that combines both as proposed in [66]. First a seed set of blogs are classified as splog or regular blogs using the content-based approach, and then the link-based approach is used to expand the seed set.

The content-based approach constructs feature vector from the text in the blog using bag of words. Other options to construct the feature vector are tf -idf encoding and n-grams. Additional features are tokenized anchors and tokenized URLs appearing in the blog posts. Other specialized features derived from the content of blogs are presented in Table 4.1. Features 1-5 in Table 4.1 leverage the fact that splogs mention a lot of named entities such as names of places, people, organizations, etc. Features 6-8, 12 and 13 leverage the observation that splogs contain a lot of repetitive blocks of text, same URLs, and anchor text. Features 9-10 take advantage of the observation that splogs contain an unusually high number of links and anchors. Feature 11 takes advantage of the observation that splogs contain many URLs with hyphens. These features are used in conjunction with other previously mentioned features to train a support vector machine (SVM) classifier with linear kernel. Linear kernel SVM performs better than the polynomial and RBF kernel. Also, bag of words performs better as compared to tf -idf encoding and n-grams in terms of classification accuracy.

Once the seed set of blogs is classified, the graph-based approach is used to expand the seed set. By looking at the hyperlink graph of the blogs and assuming that “regular blogs do not link to splogs”, all those blogs linked by regular blogs are classified as regular blogs. After a blog is classified as regular, it is added to the seed set and the process continues further until no more unclassified blogs are left. The hybrid approach shows how both the content-based and graph-based approaches can be tied together to achieve splog identification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset