CHAPTER 2

Blog Clustering and Community Discovery

Chapter 1 discussed the phenomenal growth of the blogosphere. According to the yearly reports published by Technorati, Blogosphere has a growth rate of 100% for every 5 months consistently for the last 4 years. With overwhelming amount of data, it is really difficult to keep track of the state of the blogosphere. Tools are needed to help users organize and make sense of the blogs and blog posts in the blogosphere. Clusters of similar blogs help identify the topics to gain a high-level understanding of the blogosphere. The reader can focus on the cluster he is interested in. The automatic organization of the blogs can also benefit search engines. For example, organized content in the blogosphere can improve indexing mechanisms of the search engines. Clustering can also help improve the delivery of the search results by organizing similar results together. Instead of laying out all the results as a list, similar results can be clustered together and the most dominant topic among the cluster members can be assigned as the label for the cluster. This greatly enhances the navigability of search results. This concept has been studied in webpages, e.g., Clusty (http://www.clusty.com), but remains to be explored in the blogosphere.

Clustering can be used to suggest related content to other cluster members, hence improving recommendation engines. A blogger would be interested to find other bloggers who share similar interests. Clustering of blogs would put all such bloggers in one group who could share, collaborate, and discuss their views, opinions, or perspectives over products, policies, events, or government actions. These blog clusters can also be treated as communities or focussed groups. However, this approach of community discovery is purely based on the content similarity. Nevertheless, bloggers can explicitly specify connections to other fellow bloggers via hyperlinks that connect their blogs or blog posts (as also shown in Figure 1.1), or specify their friends using social network capabilities available at some blog sites. This creates a blog or a blogger network with nodes depicting the blogs or the bloggers and edges indicating the connections among them. One way to find communities is to use the principle of homophily, which means that two people tend to communicate more often if they share similar views. Using this phenomenon, those bloggers who have more edges within themselves can be considered as a community. Identifying a set of bloggers that communicate more often among them implies that they share similar views, opinions, or interests; hence they form a community. Obviously, this approach to community discovery is purely based on the network information. It can be extended to a tool for advertisers and marketeers, for whom a global view of likes, dislikes, and interests of groups of bloggers matter.

The Web and the blogosphere are often compared, and the existing approaches in webpage clustering or web community discovery are explored for their use in similar tasks in the blogosphere. But there is a key difference between the blogosphere and the Web in terms of the lifetime of their contents (pages and links) posted on both media. Blogs are mainly used as a tool for timely communication. So the entries in the blogs are often short-lived. Most of them become obsolete and are never referred later. Thus the links in the blogs have significant temporal locality. However, in the Web, new pages may refer to very old pages (e.g., an authoritative webpage) creating a longer lifetime of content in the Web. We may also aggregate the links in the blogs over time, but miss the key temporal dynamics observed in the blogosphere. Bloggers’ interests shift over time. For instance, a blogger is initially interested in “politics” so he interacts more often with fellow bloggers who are also interested in “politics”, forming a community. Later his (her) interest shifts to “economics” so he/she shifts the interactions with the bloggers who are also interested in “economics”. Now if all the interactions are aggregated over time, we would lose this temporal dynamics in the interaction patterns and community evolutions. Based on this key difference between the Web and the blogosphere, conventional webpage clustering or community discovery algorithms do not work well in the blogosphere domain. In this chapter, we will explore some of the techniques that focus on the dynamics of interactions or the evolution of blog communities.

Since blogs are a collection of blog posts, it is often a moot point whether to cluster blog posts and extract blog clusters. However, this approach suffers from the following key challenges:

1. Content perspective: Often bloggers discuss various things in different blog posts ranging from daily routines to their opinions on a matter. It is extremely hard to identify the main theme of the blog through a single blog post. However, considering a collection of blog posts from the same blog lets us glean the blog’s main theme or the blogger’s main interests.

2. Network perspective: Due to the casual nature of the blogosphere not many bloggers cite the original source that inspired them to write their blog post, or if they reused content from some other blog or news source. This leads to an extremely sparse blog post network. However, considering a collection of blog posts from the same blog forms a critical mass of such links and the network becomes less sparse.

As mentioned earlier, blogs are associated with content and network information, both of which are extremely useful for clustering and community discovery. Although both content and network information can be used to identify blog clusters or blog communities, it is more prevalent to leverage content information to identify blog clusters and network information to identify blog communities. We illustrate this observation in Figure 2.1. However, this does not stop content information to be used in conjunction with the network information to identify blog communities or network information in conjunction with content information to identify blog clusters. In this chapter, we will first discuss network based and content based approaches to perform clustering and community discovery, and then we will discuss hybrid approaches leveraging both content as well as network information.

Images

Figure 2.1: Difference and similarities between blog clustering and blog community discovery tasks.

2.1 GRAPH BASED APPROACH

Given a blog network it is natural to represent it as a graph with blogs as nodes and hyperlinks between the blogs as edges. An illustration of the blog network is shown in Figure 1.1(b). The nodes of this graph are the blogs and the edges are the hyperlinks. As mentioned earlier, blogs interact more often with other blogs that share similar opinions or ideas. This leads to more hyperlinks or edges within few blogs. A graph based approach tries to identify such sets of blogs with more links within. These sets of blogs are referred as communities. However, given the sparse nature of hyperlink graph in the blogosphere, often it is challenging to identify communities through the hyperlink graph. Later in this section, we will discuss ways to infer blog network.

Let us assume that the blog graph is represented by G = (V, E), where V is the set of vertices or the blogs in the blog network and E is the set of edges between the blogs. These edges denote the hyperlinks between the blogs. An adjacency matrix W can be used to represent the edges. Note that W can be either a binary matrix denoting the presence or absence of an edge between two blogs or W could be a weighted adjacency matrix denoting the number of links between two blogs as shown in Figure 1.1(b). The weight of the edge between blog vi and vj is given by W(i, j). In case of a weighted adjacency matrix, the edge weights are normalized between 0 and 1.

Given the adjacency matrix W, we compute the graph Laplacian (L) of W as the basic step for spectral clustering. The graph Laplacian is computed as:

Images

where D is the diagonal matrix consisting of entries d1, d2, . . . , dn. Each di denotes the degree of the node vi and is computed as the row-sum of i-th row of W,

Images

Here n is the number of blogs in the blog graph. Note that this definition of degree is very similar to the definition of degree for undirected graph presented in Chapter 1 in Eq 1.1.

Next, we compute the eigenvalue decomposition of the matrix L and pick the first k eigenvectors e1, e2, . . . , ek such that the corresponding eigenvalues are in increasing order. Each eigenvector is n × 1. The k eigenvectors are juxtaposed to construct a matrix of n × k dimension. Each row of this matrix represents a blog in k-dimensional space. We can run k-means clustering algorithm to compute k clusters. This procedure of graph partitioning through spectral clustering is shown in Algorithm 1. Note that L is not normalized in this algorithm. However, L can be normalized to norm 1 before computing the clusters [13, 14].

Input   : Adjacency matrix: W,

Number of clusters: k

Output: k clusters of n nodes/blogs in the blog graph

1 Compute the diagonal matrix, D;

2 Compute the graph laplacian, L = DW;

3 Compute the first k eigenvectors, e1, e2, . . . , ek of L;

4 Juxtapose these eigenvectors to construct a n × k matrix;

5 Compute k clusters using k-means algorithm on this matrix;

Algorithm 1: Algorithm for spectral clustering.

Example 2.1. Lets consider an example to illustrate the functioning of spectral clustering algorithm presented in Algorithm 1. Consider a blog network presented in Figure 2.2(a). For simplicity, we consider an unweighted blog network. Each edge in the network assumes an equal weight of 1. Based on this blog network, we compute a binary adjacency matrix, W, as shown in Figure 2.2(b). Now using Eq. 2.2, we compute the matrix D, as shown in Figure 2.2(c). Graph Laplacian L of matrix W is computed using Eq. 2.1, as shown in Figure 2.2(d). Eigenvalue decomposition of L is performed and first two eigenvectors are shown in Figure 2.2(e). Each blog in the blog network is represented in terms of the first two eigenvectors and visualized in Figure 2.2(f). Based on this visualization, it can be observed that blogs {B1, B2, B3, B4} and {B5, B6, B7, B8, B9} form two separate communities. Similar communities were obtained using k-means algorithm.

Spectral clustering on graphs generate clusters such that the edges between nodes belonging to different clusters have low weight (or, more dissimilar) and edges between nodes belonging to the same cluster have high weight (or, more similar). This is also the optimization criterion for graph-cut based clustering approaches. RatioCut [15] and Ncut [14] are two variations for graph-cut approach that circumvents the problem of single-member clusters, which the conventional graph-cut approach suffers from. They circumvent the problem by considering the size of the clusters (in terms of number of cluster members-RatioCut, or total weight of edges in the cluster-Ncut) in the objective function. It can be easily shown that spectral clustering based approach can be derived as an approximation to graph-cut approaches such as RatioCut and Ncut [16]. The groups of nodes that are obtained using any of these techniques correspond to the communities. the derived communities of blogs could also be treated as blog clusters.

Images

Images

Figure 2.2: A running example of spectral clustering algorithm on a toy dataset.

As mentioned earlier, a blog graph could be extremely sparse due to the casual nature of bloggers. In such a scenario edges between blogs could be derived using content similarity. Given the similarity or distance matrix Q between blogs, there are several ways to construct W.

Images-neighborhood graph: All those nodes whose pairwise distances are smaller than Images are connected in this scheme. Specifically, if Q is the pairwise distance matrix between blogs, then two blogs vi and vj are connected by an edge if Q(i, j) < Images. The weight of the edge is same as the distance from the matrix Q. Note that Images-neighborhood graph can be easily generated for similarity matrix as well. If Qs is the similarity matrix between blogs then, we make an edge between two blogs vi and vj if Qs (i, j) > Images.

k-nearest neighbor graph: k blogs that are nearest to vi are connected to vi. Specifically, if a blog vj is among the k nearest neighbors of vi, then vj is connected to vi. This means that there will be an edge between vi and vj with the weight W(i, j) equal to the similarity between vi and vj. Note that this is an asymmetric graph because the “nearest” relationship is asymmetric. vj might be nearest to vi but vi might not be nearest to vj. There could be a node k, which is nearest to j. This leads to a directed graph. One way to make this graph undirected is to drop the directions and treat each edge as bi-directional. Another way to make it undirected is to consider only those edges that connect mutually nearest neighbor nodes/blogs. This means that two blogs, vi and vj, will be connected with an edge if both of them are among the k-nearest neighbors of each other.

Fully connected graph: All the blogs are connected with each other with the edge weights equal to the similarity between the blogs. Specifically, blogs i and j are connected with an edge of weight W(i, j) = Q(i, j).

Blogosphere is highly dynamic in terms of the content and interests of the bloggers. Aggregating the interactions of the bloggers over time would miss the dynamics in the interaction patterns and community evolution. Authors in [17] explore the temporal aspects of spectral clustering. The objective function encodes a regularization parameter (also known as the temporal cost) that enforces temporal smoothness. Authors propose two cost functions. One cost function aims to preserve cluster quality by applying the current partition to historical data and the resulting cluster quality determines the temporal cost. The second cost function aims to preserve cluster membership by comparing the current partition with the historical partition and the resulting differences determines the temporal cost. Associating the temporal cost in the objective function ensures that the spectral clustering approach is stable and less sensitive to short-term variations or noise, while at the same time being adaptive to long-term interest drift or community evolution.

The authors in [18] present a novel incremental spectral clustering framework for studying community evolution. Whenever there is a change in the matrix, W, L and D are incrementally updated. Similarly, the eigenvectors and eigenvalues are incrementally updated. First, k smallest updated eigenvectors are picked and k clusters are computed using the k-means algorithm. Another approach to handle the dynamics of blog communication while extracting communities involves a tensor based approach [19]. The blog network is represented as a link matrix Wi at timestamp i. Various such link matrices at different timestamps are stacked together to construct a tensor. The nonnegative matrix factorization of the data tensor results in the community discovery in the dynamic blog network.

2.2 CONTENT BASED APPROACH

Blogosphere has an overwhelming amount of content besides the network or link information. Bloggers express their opinions, share their views, discuss various products, policies, etc. This creates a humongous pool of intelligently crafted content, presenting ample opportunities to explore content based clustering approaches. Content based clustering on Blogosphere has also been studied from text or webpage clustering perspective. However, the inherent differences between webpages and blogs demand novel approaches geared specifically towards blogs. suffer from the sparse link structure prevalent in the blogosphere.

There is a constant stream of data in blogs that creates a highly dynamic environment. Each blog post is timestamped and appears in reverse-chronological order. Blog posts that are significantly distant in time could be very different in terms of content or the central theme. Precisely, this is because blogs are highly motivated by the events happening at that time. This is very different from the webpages. Most webpages are static in terms of content and act as a source of information. They could be updated at times but the theme remains pretty much the same. This difference makes huge impact in clustering approaches. Because of such dynamics, blog posts from a blog cannot be aggregated over time and treated as a big text document or a single webpage for clustering. Doing so will risk the loss of temporal information in the content of blogs and eventually fail to discover community evolution.

Another significant challenge in clustering blogs has to do with the colloquial usage of the language. Given the casual nature of the blogosphere, it is quite natural to observe a lot of misspellings, acronyms, and slangs in the blog content. This presents a huge information quality barrier. Under normal circumstances, such words would be considered as noise and jettisoned. However, in blogs, these words could be as informative as any other words. So one has to be conservative and retain such words. Nevertheless, content based blog clustering suffers from typical text clustering problems, i.e., high-dimensionality and sparsity. We need to pre-process the data in order to remove the noise yet retain informative but seemingly noisy words. Next, we present some pre-processing steps that can specifically help improve information quality:

• Resolve acronyms and casual usage of language using the Urban Dictionary (http://www.urbandictionary.com/). It is a slang dictionary edited by the people and contains 3,873,472 definitions (by the time of writing this book) of the slangs since 1999.

• Remove all the non-content bearing stopwords like “a”, “an”, “the”, etc. A list of stopwords can be found at
http://www.dcs.gla.ac.uk/idom/ir_resources
/linguistic_utils/stop_words

• Stem the words to retain the roots and discard common endings. This can be achieved by using porter stemmer tool (http://tartarus.org/$sim$martin/PorterStemmer/).

• Rank the words based on their tfidf scores. Pick top-m words where m could be determined experimentally. This would essentially select distinctive words, i.e., those words that are neither too frequent in all the blog posts nor too infrequent. Note that a tfidf score consists of two parts: tf and idf,

Images

where di is the frequency of term i in blog post d and the denominator is the number of occurrences of all terms.

Images

where d is the total number of blog posts and ti is the number of blog posts that contain term i.

Images

A tfidf score is normalized between “0” and “1”.

• Represent each blog post by a document-term vector also known as the vector-space model. Each row is a blog post and each column in this vector represents a term and the value is the tfidf score for that particular blog post.

Such a document-term vector representation is extremely sparse, and it gets even sparser since the individual blog posts are considered independent entities. Sparsity and high-dimensionality leads into various problems like computing the distance between different vectors efficiently. So a common solution to this problem is to transform the document-term vector representation to a concept space vector model using Latent Semantic Analysis (LSA).

Latent semantic analysis computes the singular value decomposition (SVD) of the document-term vector. SVD decomposes the document-term matrix into U, Σ, V matrices as follows:

Images

where AT is the document-term matrix and V is the document-concept matrix. Matrix V is the reduced concept space representation. This solves the problems that arise due to high-dimensionality and sparsity.

Example 2.2. To give a flavor of how latent semantic analysis creates concept vectors, we consider a hypothetical scenario where blog posts primarily discuss about stock, investing, cars, trucks, books, and harry potter. Performing LSA would result in linear combination of certain terms mentioned here into a single dimension reflecting something known as a concept. For example, the resulting dimensions might look like,

{(stock), (investing), (cars), (trucks), (books), (harry potter)} → {(1.1476 × stock + 0.3498 × investing), (1.3452 × cars + 0.2828 × trucks), (1.2974 × books + 0.7391 × harry potter)}.

In this hypothetical scenario, (1.1476 × stock + 0.3498 × investing) component could be interpreted as “stock market”, (1.3452 × car + 0.2828 × truck) component could be interpreted as “vehicle”, and (1.2974 × books + 0.7391 × harry potter) component could be interpreted as “story books”. Note that these weights are hypothetical.

 

Each row in V represents a blog post in the reduced concept space transformation. To compute the distance/similarity between any two pair of blogs, their respective vectors are considered and either cosine similarity is computed or euclidean distance is computed. This similarity or distance matrix is fed into various clustering algorithms like k-means or hierarchical to obtain clusters of blog posts. Most recent blog posts are considered for computing blog post clusters. Clusters of blogs are derived by observing the cluster membership of the blog posts belonging to that blog. This ensures that the dynamics and community evolution are preserved. It has been shown that cosine similarity gives better results for clustering as compared to euclidean distance when dealing with sparse vectors. This version of k-means clustering that uses cosine similarity instead of Euclidean distance to compute the similarity between two data instances, (blog posts in this case) is known as “spherical k-means”. have applied spherical k-means to blog clustering problem. Another work by [20] segments the

Blog posts can be segmented into various entities such as title, body or blog post content, and comments [20]. These entities are represented in separate document-term vectors. A single document-term vector is constructed from different entities using a weighted linear combination. Placing higher weights for comments gives best clustering results. Various clustering algorithms have been compared in context of blog clustering and hierarchical clustering has been reported as the best algorithm for content based clustering of blogs [21].

Another way to tap the dynamism of the blogosphere for clustering the blogs is by leveraging the collective wisdom. Collective wisdom, a concept studied by psychologists, sociologists and in philosophy, refers to the shared knowledge arrived at by individuals and groups that is used to solve problems. Here the collective wisdom of bloggers is used in clustering the blogs. Bloggers provide annotations or tags for their blogs, assuming that they annotate their blogs with closest possible or most relevant tags. In this work, authors leverage the bloggers’ wisdom collectively to develop a knowledge source, which is also represented as a label relation graph [22]. The nodes in this graph are the tags or the labels used by the bloggers to annotate their blogs. The edges between two tags (ti and tj) denote if they were used simultaneously, and the weights on the edges of the label relation graph denote the number of bloggers that labeled their blogs with both the tags ti and tj. This helps in discovering many interesting, complex, and latent relations between tags. A snippet of the label relation graph is shown in Figure 2.3. Note that the relations in the label relation graph are highly dynamic since they change with the change in usage of these tags by the bloggers. This collective wisdom based knowledge provides a naïve similarity matrix between blogs based on the tags. Such a similarity matrix is augmented by using the content of the blogs and a more robust and stable clustering approach is developed using the techniques mentioned earlier.

Images

Figure 2.3: An instance of Label Relation Graph.

2.3 HYBRID APPROACH

In the previouse sections, we looked at approaches that either focused on content or the graph structure to compute blog clusters and/or communities. However, clustering purely based on content misses out a whole lot of information that is encoded in the blog graph or blog network. Similarly, simply using network information for identifying blog communities misses out the valuable content information. In this section, we look at an approach that leverages both content as well as network information of the blogosphere to compute clusters and/or communities.

The essential assumption behind using content as well as network information for blog clustering or community discovery is that: a set of blogs that link more often with the blogs in this set than with those that are outside and they share similar content, reflect tighter communities or blog clusters [23]. As discussed in Section 2.1, the network information of the blogs can be represented by W. Similarly, the blogs can also be represented as a document-term matrix A as discussed in Section 2.2. A can also be the blog-tag matrix where the individual instances or the rows of the matrix A are the blogs and the features or the columns are the user-defined tags, as in [23]. A can be decomposed into reduced concept space using the SVD transformation as shown in Eq. 2.6. Then both the network and content information can be represented as:

Images

The parameter β (0 ≤ β ≤ 1) controls the contribution of the content matrix (A) and the link or network matrix (W) in computing the blog communities or clusters. β = 1 means equal contribution of both content as well as network information for clustering or community discovery. β = 0 means only content information is used. This yields a bipartite graph between blogs and the terms, tags, or concepts. The eigenvector decomposition of W′ gives the partition of the nodes in the blog network or the communities. Such a partition minimizes the number of edges that are cut resulting in clusters that have more links within the set than outside. Moreover, such nodes share similar content. Additional constraints can be placed such as must-link or cannot-link pairs that are derived through either domain knowledge or through the data. Such constraints have been well studied in [24] under the notions of constrained spectral clustering.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset