CHAPTER 1

Modeling Blogosphere

The advent of participatory Web applications (or Web 2.0 [1]) has created online media that has turned the former mass information consumers to the present information producers [2]. Examples include blogs, wikis, social annotation and tagging, media sharing, and various other such services. A blog site or simply blog (short for web log) is a collection of entries by individuals displayed in reverse chronological order. These entries, known as the blog posts, can typically combine text, images, and links to other blogs, blog posts, and/or to Web pages. Blogging is becoming a popular means for mass Web users to express, communicate, share, collaborate, debate, and reflect. Blogosphere is a virtual universe that contains all blogs. Blogosphere also represents the network of the blogs where each node could be either a blog or a blog post and the edges depict a hyperlink between two nodes in this blog network. Bloggers, the blog writers, loosely form their special interest communities where they share thoughts, express opinions, debate ideas, and offer suggestions interactively. Blogosphere provides a conducive platform to build the virtual communities of special interests. It reshapes business models [3], assists viral marketing [4], provides trend analysis and sales prediction [5, 6], aids counter-terrorism efforts [7], and acts as grassroot information sources [8].

Past few years have observed a phenomenal growth in the blogosphere. Technorati (http://technorati.com/blogging/state-of-the-blogosphere/) published a report on the growth of the blogosphere. The report mentioned that the blogosphere is consistently doubling every 5 months for the last 4 years and the size was estimated to be approximately 133 million blogs by December 2008. Furthermore, 2 new blogs or roughly 18.6 new blog posts are added to the blogosphere every second. Given the prominent and continued growth of the blogosphere, it is natural to ask whether it is possible to model the growth of the blogosphere and derive some macro-level statistics that characterizes the blog network. To study the complex network such as blogosphere, researchers can develop blog models and generate data through these models while continuously collecting blog data.

A unique characteristic of blogs is the evolution of its content over time. Unlike traditional web pages that are more or less static (with a few additions, deletions, substitutions, or changes in layout) over time, blogs accumulate content in a prescribed fashion. While the content on web pages is hard to track over time, content is appended to blogs in the form of blog posts and comments which are timestamped in a reverse-chronological order. Another key characteristic of the blogs is the ability of the readers to express their opinions about a blog post interactively by leaving comments. These comments become an integral part of the blog post. Some comments may trigger more comments or a completely new blog post. Moreover, these interactions could be observed within the same blog or across different blogs through references and citations. These timestamped, interaction patterns are highly dynamic, presenting challenges observing the evolution of the blogosphere. Analyzing complex and temporal phenomena often requires the assistance of synthetic data that matches observed characteristics. Generating synthetic data requires developing and fitting a model on the observed data. Such a model can give insights to the interaction patterns in the blogosphere, help in evaluating hypotheses and concepts, study characteristic properties of the network, and predict trends, among others.

1.1 MODELING ESSENTIALS

The blogosphere consists of two main graph structures - a blog network and a post network. A post network is formed by considering the links between blog posts, and ignoring the blogs to which they belong. In a post network, the nodes represent individual blog posts, and edges represent the links between them. A post network gives a microscopic view of the blogosphere and helps in discerning “high-resolution” details like blog post level interactions, communication patterns in blog post interactions, authoritative blog post based on links, etc. A blog network is formed by collapsing those individual nodes in the post network that belong to a single blog, to a single node. By doing so links between the blog posts that belong to a single blog disappear and links between blog posts of different blogs are agglomerated and weighted accordingly. A blog network gives a macroscopic view of the blogosphere and helps in observing “low-resolution” details like blog level interactions, communication patterns in blog-blog interactions, authoritative blogs based on links, etc. Both post and blog networks are directed graph.

Example 1.1. An example of a post network and a corresponding blog network is displayed in Figure 1.1. A node (orange rectangle) in Figure 1.1(a) represents a blog post and the edges between the nodes are the links. An edge pointing to P12 from P24 means that blog post P24 has an outlink to blog post P12. Each blog post belongs to a blog. For instance, blog posts P11, P12, and P13 belong to the blog B1. Figure 1.1(b) shows the corresponding blog network obtained by collapsing the post network. Each blog represents the node and the links between the blog posts are collapsed to form edges between the blogs. The weight on the edges of a blog graph represents the number of links in the original post network that were collapsed. For instance, the edges from P21 to P12, P24 to P12, and P23 to P13 are collapsed to form a single edge from blog B2 to B1. Since three links were collapsed to form a single edge, the weight on this edge is 3. Note that the edges between the posts of a blog are not retained in the blog network. For instance, edge from P12 to P11 does not exist in the blog network.

A third type of network structure in the blogosphere, a blogger network, could be derived from the post network. The nodes in the post network could be replaced by the authors of these posts or the bloggers. Nodes that correspond to the same blogger are collapsed. The links between the blog posts form the edges between the corresponding bloggers. A blogger network gives a different interpretation of the blogosphere. It generates a type of social network among bloggers. Note that a blogger network is also a directed graph. However, in this chapter we will restrict our focus to the study of post and blog networks. Most of the models and network characteristics presented here are equally applicable to blogger networks.

Images

Figure 1.1: A post network and the corresponding blog network.

Since the primary difference between a blog network and a post network is the granularity of the information, models developed to simulate either of the two network structures work equally well for the other. Structures of both types of network exhibit the same network characteristics. In the remaining of this chapter, we describe the models and the network characteristics without differentiating the two network structures.

Several characteristic statistics of the blogosphere can be studied and modeled. Observing these statistics helps identifying any unusual behavior in a network and evaluating the proposed models because any model that attempts to simulate the network must be able to reproduce these statistics. Network statistics can be broadly categorized into stable statistics (those stabilize over time) or evolutionary statistics (those evolve or change dynamically).

Stable Statistics: Stable statistics are time invariant and present a macroscopic information about the network. Examples of stable statistics are degree distribution, clustering coefficient, diameter of the network, average path length of the network etc.

Evolutionary Statistics: Evolutionary statistics are time variant properties of the network and tell us how the network evolves over time. Examples of evolutionary statistics include average lifespan of communities, growth rate of degree of a node etc.

Stable statistics can be further categorized into individual statistics, relational statistics, and global statistics. Each of these categories is explained as follows:

Individual Statistics: Individual statistics provide details about individual nodes of the network, e.g., degree distribution.

Images

Figure 1.2: Classification of network statistics.

Relational Statistics: Relational statistics provide details about the edges between the nodes or relations, hence the name relational statistics, e.g., clustering coefficient.

Global Statistics: Global statistics provide global information about the network, e.g., diameter of the network, and average path length of the network.

A hierarchical organization of various categories of statistics is depicted in Figure 1.2.

Recently, some additional statistics are proposed to characterize a network such as degree correlation, reciprocity, average degree, total density, community size and density distribution, connectivity, blogger growth and attrition, blogger retention, edge persistence, etc. In this chapter, we discuss the most prominent and widely-used statistics. Specifically, we focus on degree distribution, clustering coefficient, diameter of the network, and average path length of the network.

Degree distribution: Degree of a node in a network refers to the number of edges that are incident upon the node or the number of connections a node has to other nodes. This definition of the degree of a node refers to the case of undirected network. In a directed network, degree of a node is further specified according to edge directions - indegree and outdegree. Indegree refers to the number of incoming edges of a node, while outdegree refers to the number of outgoing edges from the node.

Formally, for a network G = (V, E) with V being a set of vertices and E a set of edges between the vertices, an edge eij connects vertices vi and vj. For an undirected network, degree d(vi) of a vertex vi is defined as:

Images

Similarly, for a directed network indegree (din (vi)) can be defined as:

Images

Table 1.1: Indegree and outdegree values for the nodes of the post network in Figure 1.1(a).

Images

and outdegree (dout (vi)) can be defined as:

Images

Degree distribution is the probability distribution of these degree values over the whole network. Specifically, degree distribution P(k) is defined as the fraction of nodes that have the degree (indegree or outdegree) k. For instance, if there are n nodes in a network (n = |V|) and nk nodes have degree k (nk = |v| s.t. ∀vV, d(v) = k), then P(k) = nk/n. In a directed network, we can observe indegree (Pin ((k)) and outdegree (Pout(k)) distributions separately. Degree distribution gives macroscopic information about the network as compared to the degree values of each node of the network.

Example 1.2. Node P13 in Figure 1.1(a) has an indegree of 3 and an outdegree of 1. Table 1.1 lists the indegree and outdegree values of all the nodes in the post network in Figure 1.1(a). Referring to Figure 1.1(a), Pin(1) = 6/13 and Pout (1) = 5/13. Table 1.2 presents the complete indegree (Pin(k)) and outdegree (Pout(k)) distribution for all possible values of k for Figure 1.1(a).

Clustering coefficient: It quantifies the cliquishness of a vertex with its adjacent vertices or neighbors. Adjacent vertices of a vertex v are those vertices that are directly connected to v. Clustering coefficient of a vertex v takes real values between “0” and “1”. A “0” clustering coefficient of v means that there is no edge between any of the adjacent nodes of v. Whereas, a “1” clustering coefficient of v means that all the adjacent vertices of v are directly connected among themselves, hence they form a clique (complete graph). Clustering coefficient of a vertex v is defined as the proportion of edges between the vertices that are adjacent to v divided by the total number of edges that could possibly exist between the adjacent vertices. Clustering coefficient of a network is then defined as the average of the clustering coefficient values of all the vertices in the network.

Table 1.2: Indegree and outdegree distribution of the post network in Figure 1.1(a).

Images

Let Nv be the set of adjacent vertices of v and kv be the number of adjacent vertices to node v, where kv = |Nv|. For a directed network, there are kv(kv − 1) possible edges that could exist between kv nodes adjacent to v. This is due to the fact that in a directed network edge eij is different from eji. Thus the clustering coefficient Cv for a node v for a directed network can be defined as:

Images

For an undirected network, there are kv(kv − 1)/2 possible edges between kv nodes adjacent to v. This is because in an undirected network edge eij and eji are the same. Thus the clustering coefficient Cv for a node v in an undirected network can be defined as:

Images

The clustering coefficient for the whole network (C) is defined as the average of clustering coefficients of all the vertices in the network:

Images

Example 1.3. Lets look at the clustering coefficient of the vertex B4 in the blog network in Figure 1.1(b). B4 has two adjacent vertices, B1 and B3. Hence kB4 = 2. Since the blog network is a directed network, the total possible edges that could exist between B1 and B3 is given by kB4(kB4 − 1) = 2. From the network we can see that there is only one edge between B1 and B3, so the clustering coefficient of B4, i.e., CB4 = 1/2 = 0.5. The clustering coefficient values of all the vertices in the blog network in Figure 1.1(b) is computed and shown in Table 1.3. Then the clustering coefficient for the whole blog network, C = 3/4 = 0.75. Note that if the blog network in Figure 1.1(b) is converted to an undirected network by removing the directions of the edges, the clustering coefficient of all the vertices becomes 1 and so is the clustering coefficient of the whole network.

Table 1.3: Clustering Coefficient of the vertices in the blog network in Figure 1.1(b).

 

   v

Cv

   B1

1/2

   B2

1

   B3

1

   B4

1/2

Diameter: Diameter of a network is defined as the shortest path between the farthest pair of vertices. This essentially tells us that between any pair of vertices the number of edges to be traversed is always less than or equal to the value computed as the diameter of the network. To compute the diameter we first find the shortest path between all the pairs of vertices and select the path with the maximum length as the diameter of the network. Mathematically, diameter of a network G (directed or undirected) can be defined as:

Images

where ps(vi, vj) denotes the length of the shortest path between vertices vi and vj. It is assumed that the network is connected, meaning that one can reach any vertex starting from any other vertex. However, if the network is not connected then one can identify sub-networks that are connected, also known as connected components of the network. Then pick the largest connected component (connected component with largest number of vertices) and compute its diameter.

Example 1.4. Consider the blog network shown in Figure 1.1(b). This network is not completely connected. So we can identify the largest connected component, which in this case consists of the nodes B1 and B4. The diameter of this component is 1. However, if we consider the undirected version of this network by omitting the directions of the edges then the diameter of the network would be 2, i.e., the path B2-B1-B4.

Average path length: It is defined as the average number of edges that lie along the shortest paths for all possible pairs of vertices in a network. From the perspective of social networks, average path length tells the average number of individuals one has to communicate through to reach a complete stranger. Shorter average path length is always more favorable. Mathematically, the average path length pavg(G) of a network G (directed or undirected) can be defined as:

Images

where ps(vi, vj) denotes the length of the shortest path between vertices vi and vj. If there is no path between vi and vj then ps(vi, vj) = 0. Smaller pavg does not necessarily mean that the information will diffuse faster. This is due to the fact that only reachable pair of nodes are considered while computing pavg. A network that is largely disconnected could still have a short average path length. Therefore, only connectedness and average path length together can indicate the speed of information diffusion (later in Chapter 3).

Example 1.5. Consider the blog network shown in Figure 1.1(b). We compute the shortest paths between all possible pairs of vertices as shown below. The average path length pavg for this network is 9/12 = 0.75.

Images

1.2 PREFERENTIAL ATTACHMENT BLOG MODELS

Now we look at the preferential attachment model that closely simulates the statistics observed in real blogosphere dataset. According to preferential attachment model, a node of higher degree has higher chance to get new edges. Mathematically, the probability P(eij) of an edge between vertices vi and vj is proportional to the number of edges incident upon vi, as shown below:

Images

An undirected network following the preferential attachment model can be generated as follows: a new vertex (vj) to be added to the network creates an edge with a node vi with the probability proportional to the number of edges incident upon vi, as shown in Eq 1.9.

In a directed network, the probability (P(eij)) with which a node vj will create an outlink to a node vi is proportional to the number of links already pointing to vi or the number of inlinks of vi. Mathematically,

Images

Images

Figure 1.3: Basic form of the Power Law function with (a) variable a = {1, 2, 3, 4, 5}, and (b) variable β = {2.1, 2.9}.

Similarly, the probability (P(eij)) with which a node vj will acquire an inlink from a node vi is proportional to the number of links coming out of vi or the number of outlinks of vi. Mathematically,

Images

A directed network following the preferential attachment model can be generated as follows: an edge from a new vertex vj to the existing vertex vi is created with the probability proportional to the number of inlinks of vi, as shown in Eq 1.10. Here the node vj acts as a source vertex and this models the outlink generation for a new vertex to be added to the network. Similarly, inlink generation for a new vertex vj can be modeled using Eq 1.11. Any existing node vi points to a new node vj with a probability proportional to the number of outlinks of vi. vj acts as a sink vertex.

Example 1.6. Consider the post network shown in Figure 1.1(a). A new vertex in the network would most likely create a outlink to vertices P13 or P31 since they have the highest indegree (3). Similarly, the new vertex would most likely acquire an inlink from vertex P43 since P43 has the highest outdegree (3).

According to the preferential attachment model, vertices with larger degree tend to attract more edges. This creates a “rich get richer” phenomenon. This phenomenon has also been widely studied in complex networks, scientific productivity, journal use, “cumulative advantage” in citation learning, wealth distribution domain, etc. Preferential attachment model generates a power law distribution,

Images

where a and β are constants, x is the characteristic that is simulated using the preferential attachment model and f models the distribution of x. If x is the degree of the network, then f(x) models the degree distribution. Figure 1.3(a) shows the basic form of power law with parameters a = {1, 2, 3, 4, 5} and β = 2.1. This shows that the lower the value of a the steeper the fall in the power law distribution. Figure 1.3(b) shows the basic form of power law with parameters a = {3} and β = {2.1, 2.9}. This shows that the higher the value of β the steeper the fall in the power law distribution. Power law generates a heavy tailed distribution. β in Eq 1.12 is also called the scaling exponent. This means that f(cx) ∝ f(x), where c is a constant. Thus if the function’s argument is rescaled the basic form still remains intact except changing the constant of proportionality, as shown below after substituting x with cx in Eq 1.12,

Images

The above derivation shows that scaling the argument of the function from x to cx does not affect the basic form of the power law function. Notice that the function has a linear relationship with slope β. This linear form is extremely helpful in developing a power law model for degree distribution of the blogosphere. Figure 1.4 shows the outdegree distribution of a set of blogs collected from Blogcatalog.com. The x-axis shows the log of the outdegree value and y-axis shows the log of the frequency of the outdegree values. As shown in Figure 1.4 the outdegree distribution of the blogs follow a power law distribution with a linear fit, the slope of which is equal to the scaling exponent β = 1.1693. We also study the indegree (follower) and outdegree (friend) distribution of a popular microblogging site, Twitter. Both indegree and outdegree distributions follow power law as depicted in Figure 1.5(a) and (b), respectively. Based on the linear fit on the log-log scale of the power law distributions, we can estimate the scaling exponent for indegree and outdegree distributions as 1.1517 and 1.3836, respectively.

An interesting property of power law distribution is its scale invariance or the ability to generate a scale free network. Scale invariance also refers to the self-similarity property where the sub-network is similar to the whole network in terms of network statistics. One way to identify the scale invariance property is to extract a subnetwork by performing a E-R (Erdos-Renyi) random walk on the whole network and then observe the degree distribution of the subnetwork. Blogosphere has been found to exhibit the scale invariant property as also simulated by power law models.

Other characteristics of the blogosphere such as high clustering coefficient, short average path length, and small network diameter can be simulated using the preferential attachment model. High clustering coefficient indicates that often bloggers communicate within a small focused group as reflected by their link interactions, hence forming a sub-community structure. Preferential attachment model generates a network with high clustering coefficient. Blogosphere is typically characterized by short average path length, which indicates that, on average, one can reach a blog from another blog by following the links in small number of hops. Preferential attachment model generates networks with short average path length. Blogosphere is also characterized by small network diameter, i.e., even the furthest pair of blogs are not very far. The network thus generated through a preferential attachment model exhibits small network diameter. These characteristics are explained by the small-world phenomena and such networks are also known as small-world networks. Here most vertices are not neighbors of one another but most vertices can be reached from every other vertex by a small number of steps or hops. This is because the shortest path between vertices passes through highly connected vertices also the hubs. These hubs ensure that long range vertices could be reached within small hops.

Images

Figure 1.4: Outdegree distribution of Blogcatalog [9].

Preferential Attachment Models - Hybrid variant: A preferential attachment model always favors the vertex with higher degree, hence making the rich even richer. The hybrid variant tries to give a lucky poor vertex a chance by randomly selecting a vertex to link to. The probability of forming an edge between a new vertex vj and existing vertex vi is proportional to a linear sum of the degree of vertex vi and a uniform random probability Images [10], as shown below:

Images

Here γ ≤ 1 tunes the contribution of the preferential attachment model and the uniform random model. This also attempts to generate a reducible network by preventing formation of isolated subnetworks

Preferential Attachment Models - α variant: The α variant, also known as α-scheme, is another modification of the preferential attachment model [11]. In a directed network, according to α-scheme, a new vertex vj acting as a source vertex connects to a vertex vi proportional to the in-weight of vi. In-weight of vi is defined as the sum of the indegree of vi and α, where α is a user-specified model parameter. Similarly, vj acting as a sink vertex is linked by a vertex vi with a probability proportional to the out-weight of vi. Out-weight of vi is defined as the sum of the outdegree of vi and α. Mathematically, Eqs 1.10 and 1.11 can be rewritten according to α-scheme as:

Images

Figure 1.5: Degree distribution of Twitter (a) Indegree and (b) Outdegree.

Images

In an undirected network, α-scheme is similarly defined as shown below:

Images

Note that the α-preferential attachment model can be viewed as adding α edges in the network. This could be done to reduce the sparsity in the network. α = 1 is a special case and often referred as 1-preferential attachment model.

1.2.1 LOG-NORMAL DISTRIBUTION MODELS

Recently, more sophisticated models like Log-Normal distribution models and Double Pareto Log-Normal distribution models have been studied to simulate the social networks such as the ones emerging from discussion forums and mobile call graphs, respectively. However, their application to the blogosphere still remains an area of research. Here we briefly introduce these models.

The log-normal distribution is a single tailed probability distribution of a random variable whose logarithm is normally distributed, hence the name log-normal. The probability density function of the log-normal distribution has the following form:

Images

Figure 1.6: A typical form of the Log-Normal distribution function with σ = 2.04, μ = 1.03, and θ = 1.00.

Images

where μ and σ are the mean and standard deviation of the distribution, respectively; and θ simply defines the shift in the distribution, which in case of degree distribution represents a lower bound of the degree values. To determine the maximum likelihood estimators of the parameters μ and σ of the log-normal distribution, the same procedure can be used as the normal distribution. The estimated values of these parameters are given by:

Images

A basic form of the log-normal distribution is shown in Figure 1.6 with the values of the parameters σ = 2.04, μ = 1.03, and θ = 1.00. Gomez et al. [12] used log-normal distribution to model the degree distribution on one of the popular discussion forums, Slashdot.org. They concluded that log-normal is a better fit in terms of Kolmogrov-Smirnov statistical tests for Slashdot’s degree distribution as compared to the power law.

In Figure 1.7(a) and (b), we compared the log-log plots of power law and log-normal distributions. Figure 1.7(a) represents the log-log plot of the power law shown in Figure 1.3(b) with a = 3 and β = 2.1. Similarly, Figure 1.7(b) depicts the log-log plot of the log-normal distribution shown in Figure 1.6. The linear relation of log-log plot for power law in Figure 1.7(a) is quite different from the parabolic nature of the log-log plot of the log-normal distribution as shown in Figure 1.7(b).

Double pareto log-normal (DPLN) distribution is a sophisticated version of the log-normal distribution functions and can be considered as a mixture of log-normal distributions. The parametric form of DPLN distribution is much more complex than log-normals. As the models get more and more complex they might run into the risk of overfitting the data. The distinct features of DPLN distribution are two linear sub-plots in the log-log scale and a hyperbolic middle section. DPLN distribution has been used to model the properties of the mobile call graphs such as: number of calls, distinct call partners, and total talk time and have been proved to outperform power law and log-normal distributions. However, the applicability of DPLN to the blogosphere needs further exploration.

Images

Figure 1.7: Log-log plots of the (a) power law distribution and (b) log-normal distribution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset