CHAPTER 3

Influence and Trust

In this chapter, we discuss two significant concepts arising from network structure, position, roles of individuals and knowledge solicitations of other members, i.e., influence and trust. Influence is a characteristic of an individual that defines the capacity of exerting some effect on other individual(s). A blogger is influential if he/she has the capacity to affect the behavior of fellow bloggers. Such bloggers are also called leaders or bellwethers of the community. Trust, on the other hand, has more to do with knowledge solicitation. It is the belief in the reliability of one’s actions. People believe what a trusted individual says or does. Trust acts as lubricant that improves information flow and promotes frankness and transparency, e.g., a book review by a person with real username is more trusted. Trust can also be helpful in online discussion forums where users look for informative and trustworthy answers to their questions. Giving trustworthy recommendations could also improve the customer retention policies. An influential individual is also trustworthy; however, a trustworthy individual may not be an influential individual. These concepts will be defined formally in subsequent sections.

3.1 INFLUENCE

The advent of participatory Web applications (or Web 2.0 [1]) has turned the former mass information consumers to the present information producers [2]. Examples include blogs, wikis, social annotation and tagging, media sharing, and other such services. Blogging is becoming a popular means for mass Web users to express, communicate, share, collaborate, debate, and reflect. Blogosphere provides a conducive platform to build the virtual communities of special interests. We studied how communities can be extracted from Blogosphere in Chapter 2. inspires viral marketing [4], provides trend analysis and sales prediction [5, 6], aids counter-terrorism efforts [7] and acts as grassroot information sources [8].

Virtual communities bear many resemblances with physical communities. Physical communities are formed by people with similar interests, where they convene and discuss issues ranging from daily matters, politics, social events to business ideas and decision. Bloggers meet in their virtual communities in Blogosphere and conduct similar activities as their counterparts in a physical community. In a physical world, according to [25], 83% of people prefer consulting family, friends or an expert over traditional advertising before trying a new restaurant, 71% of people do the same before buying a prescription drug or visiting a place, and 61% of people talk to family, friends or an expert before watching a movie. In short, before people buy or make decisions, they talk and listen to other’s experience, opinions, and suggestions. The latter affect the former in their decision making, and are aptly termed as the influentials [25]. Influence has always been a topic of unabated interest in business and society. Influential members of a physical community have been recognized as market-movers, those who can sway opinions, affect many on a wide spectrum of decisions. For about every 10 people, one leads the other 9 [25]. With the pervasive presence and ease of use of the Web, an increasing number of people with different backgrounds flock to the Web - a virtual world to conduct many previously inconceivable activities from shopping, to making friends, and to publishing. As we draw parallels between physical and virtual communities, we are intrigued by the questions like whether there exist the influentials in a virtual community (a blog), who they are, and how to find them.

Blogs can be categorized into two major types: individual and community blogs. Individual blogs are single-authored who record their thoughts, express their opinions, and offer suggestions or ideas. Others can comment on a blog post, but cannot start a new line of blog posts. These are more like diary entries or personal experiences. Examples of individual blogs are Sifry’s Alerts: David Sifry’s musings (http://www.sifry.com/alerts/) (Founder & CEO, Technorati), Ratcliffe Blog – Mitch’s Open Notebook (http://www.ratcliffeblog.com/), The Webquarters (http://webquarters.blogspot.com/), etc. A community blog is where each blogger can not only comment on some blog posts, but also start some topic lines. It is a place where a group or community of bloggers voluntarily get together with common interests, share their thoughts, exchange ideas and opinions, and discuss various issues related to a special interest. Examples of community blog sites are Google’s Official Blog (http://googleblog.blogspot.com/), The Unofficial Apple Weblog (http://www.tuaw.com/), Boing Boing: A Directory of Wonderful Things (http://boingboing.net/), etc. For an individual blog, the host is the only one who initiates and leads the discussions and thus is naturally the influential blogger of his/her site. Perhaps, community discovery approaches studied in Chapter 2 can be used to extract communities among a set of individual blogs and then influential blog/blogger can be identified for synthesized communities. For a community blog, many have equal opportunities to participate; it is natural to ask who are influential bloggers.

The identification of the influential bloggers can help develop innovative business opportunities, forge political agendas, discuss social and societal issues, and lead to many interesting applications. For example, the influentials are often market-movers. Since they can influence buying decisions of the fellow bloggers, identifying them can help companies better understand the key concerns and new trends about products interesting to them, and smartly affect them with additional information and consultation to turn them into unofficial spokesmen. As reported in [26], approximately 64% advertising companies have acknowledged this phenomenon and are starting blog advertising.

The influentials could also sway opinions in political campaigns, elections, and affect reactions to government policies [27]. Tapping on the influentials can help understand the changing interests, foresee potential pitfalls and likely gains, and adapt plans timely and pro-actively (not just reactively). The influentials can also help in customer support and troubleshooting since their solutions are trustworthy because of the sense of authority these influentials possess. An increasing number of companies these days host blog sites for their customers where people can discuss issues related to a product. For example, Macromedia (http://weblogs.macromedia.com/) aggregates, categorizes and searches the blog posts of 500 people who write about Macromedia’s technology. Instead of going through every blog post, an excellent entry point is to start with the influentials’ posts.

Some recent numbers from Technorati (http://www.technorati.com/) show a 100% increase in the size of Blogosphere every six months, “. . . , about 1.6 Million postings per day, or about 18.6 posts per second” (http://www.sifry.com/alerts/archives
/000436.html
). Blogosphere has grown over 60 times during the past three years. With such a phenomenal growth, novel ways have to be developed in order to keep track of the developments in the blogosphere.

The problem of ranking blog sites or bloggers differs from that of finding authoritative webpages using algorithms like PageRank [28] and HITS [29]. PageRank would assign a numerical weight for each blog post to “measure” its relative importance. The PageRank score of a blog post (pi) is a probability (P R(pi)) that represents the likelihood of a random surfer clicking on links will arrive on this blog post and is represented as:

Images

where d is the damping factor that the random surfer stops clicking at some time, M(pi) is the set of all the blog posts that link to pi, L(pj) is the total number of outbound links on blog post pj, and N is the total number of blog posts. The PageRank values R could be computed as the entries of the dominant eigenvector of the modified adjacency matrix,

Images

where the function l(pi, pj) is 1 if blog post pj links to blog post pi, and 0 otherwise.

As pointed out in [30], blog sites in the blogosphere are very sparsely linked and it is not suitable to rank blog sites using Web ranking algorithms. The Random Surfer model of webpage ranking algorithms [28] does not work well for sparsely linked structures. The temporal aspect is most significant in blog domain. While a webpage may gain authority over time (as its adjacency matrix gets denser), the older a blog post gets the less attention it gets, and hence its influence diminishes over time. This is due to the fact that the adjacency matrix of blogs (considered as a graph) will get sparser as thousands of new sparsely-linked blog posts appear every day.

Influential Blog Sites vs. Influential Bloggers: Researchers have studied the influence in the blogosphere from the perspective of both influential blog sites as well as influential bloggers. Finding influential blog sites in the blogosphere is an important research problem, which studies how some blog sites influence the external world and within the blogosphere [31]. This line of research delves into identifying those blog sites that are most popular. Such blog sites could be maintained by several bloggers and little or nothing is known about the influence of an individual blogger. It is, however, different than the problem of identifying influential bloggers in a community. Regardless of a blog being influential or not, it can have its influential bloggers. Blogosphere follows a power law distribution [32] with very few influential blog sites forming the short head of the distribution and a large number of non-influential sites form the long tail where abundant new business, marketing, and development opportunities can be explored [33]. Identifying influential bloggers at a blog site is regardless of the site being influential or not.

3.1.1 GRAPH BASED APPROACH

Treating the blogosphere as a graph is a natural choice. As mentioned in Chapter 1 the blogosphere can be depicted as either a blog graph or a blog post graph. One can easily derive a blogger graph based on the blog graph. Identifying central or influential nodes in such a graph representation is a well studied problem from the social network analysis perspective. Based on the graph one can identify influential blogs, influential blog posts, or influential bloggers respectively for blog graph, blog post graph or blogger graph. We will first examine some of these measures commonly known as centrality measures before embarking on identifying central or influential nodes in the blogosphere.

Centrality measures determine the relative importance of a vertex within the graph based on its position in the graph. These measures help in studying the structural attributes of nodes in a network. The location of a node in the network determines the importance, influence or prominence of a node in the network. These measures also decide the extent to which the network revolves around a node. Four widely used centrality measures in network analysis are: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Next, we explain each one of them in detail.

Degree Centrality: Degree centrality refers to the number of edges incident upon a node. In the case of a directed graph, degree centrality is further defined as indegree and outdegree centrality. Higher indegree centrality refers to “popularity” of a node, whereas higher outdegree centrality refers to “gregariousness” of a node. Mathematically, degree centrality of a node u is defined as,

Images

where CD(u) refers to the degree centrality of the node u, d(u) refers to the degree of node u or the number of edges incident upon u or the total number of ties u has, and n denotes the total number of nodes in the graph. Note that d(u) is computed using Equation 1.1. In case of a directed graph, indegree and outdegree centrality can be computed as,

Images

where CinD(u) and CoutD(u) denote the indegree and outdegree centrality, din(u) and dout(u) denote the indegree and outdegree of node u and are computed using Equation 1.2 and Equation 1.3, respectively.

Example 3.1. Consider the blog graph in Figure 1.1(b). CinD(B1) = 2/3, CinD(B2) = 0, CinD(B3) = 2/3, and CinD(B4) = 1/3; CoutD(B1) = 2/3, CoutD(B2) = 1/3, CoutD(B3) = 0, and CoutD(B4) = 2/3.

Higher degree means that the node has higher probability to catch whatever is flowing through the network, hence making it more prominent.

Betweenness Centrality: Betweenness centrality of a node u refers to the ratio of the number of geodesic paths between any two nodes s and t that pass through node u to the total number of geodesic paths that may exist between s and t. Mathematically, it is defined as,

Images

where CB (u) denotes the betweenness centrality of u, σst the total number of geodesic paths between s and t, σst (u) the number of geodesic paths between s and t that pass through u, and V the set of vertices in the graph. This measure evaluates how well a node can act as a “bridge” or intermediary between different subgraphs. A node with high betweenness centrality can become a “broker” between different subgraphs.

Closeness Centrality: Closeness centrality refers to the mean geodesic distance of a node to all other nodes in the graph. The nodes that have shorter geodesic distance to other nodes in the graph have higher closeness. Mathematically, closeness centrality is defined as,

Images

where CC(u) refers to the closeness centrality of the node u, dG(u, t) denotes the geodesic path between u and t, and n is the total number of vertices in the graph with V vertices. A node with the highest closeness centrality value could be imagined as the “nearest” node to the other nodes in the network.

Eigenvector Centrality: Eigenvector centrality defines a node to be central if it is connected to those who are central. This could be gauged as the “authoritativeness” of a node. Essentially, if a blog is connected to several “popular” or well-known blogs, then automatically its prominence is increased. This is the only centrality measure that computes the prominence of a node or the influence of a node based on the influence or prominence of the nodes it is connected to. Google’s PageRank algorithm is also motivated by the eigenvector centrality. For the i-th node the centrality score is the sum of the scores of the nodes it is connected to. Mathematically,

Images

where M(i) denotes the set of nodes pi is connected to, A the adjacency matrix, Aij is 1 if i-th node is adjacent to j-th node and 0 otherwise, κ the eigenvalues, and n is the total number of nodes. In vector notation,

Images

or as the eigenvector equation

Images

Hence the principal eigenvector of the adjacency matrix of the network gives the eigenvector centrality scores of the nodes in the graph.

Other measures for analyzing social networks are clustering coefficient (the likelihood that associates of a nodes are associates among themselves, which ensures greater cliquishness), cohesion (extent to which the actors are connected directly to each other), density (proportion of ties of a node to the total number of ties this node’s friends have), radiality (extent to which an individual’s network reaches out into the network and provides novel information), and reach (extent to which any member of a network can reach other members of the network).

Influential nodes in the blog graph can also be identified using the information diffusion theory. Nodes that maximize the information spread [34] can be considered as the key players or influential nodes in the graph. Two fundamental models for the information diffusion process have been considered in the literature:

• Threshold Models: Each node u in the network has a threshold that determines its tolerance or susceptibility to the infection. This threshold tu is typically drawn from a probability distribution, tu ∈ [0, 1]. The set of neighbors of u is defined by Γ(u). Each neighbor v of u, i.e., v ∈ Γ(u), has a nonnegative edge weight, wu,v, and Images. wu,v is determined based on the strength of the tie between u and v, which could be computed in a number of ways, such as, number of interactions u and v had. The tie strength has a direct impact on the v’s ability to infect u. The larger the tie strength the more chances are that v can infect u. Those neighbors of u that are infected exert infection over u and u gets infected if, Images

• Cascade Models: As the name suggests cascade models simulate a cascade effect of infection. If a social contact v ∈ Γ(u) gets infected, u also gets infected with probability pu,v which is proportional to the edge weight, wu,v. Cascade models can be categorized into two types: independent cascade models and generalized cascade models. In an independent cascade model the influence is independent of the history of all other node activations, i.e., v gets a single chance to infect u. If v is unsuccessful in infecting u then v never attempts to infect u. A generalized cascade model [35] generalizes the independent cascade model by relaxing the independence assumption.

Gruhl et al. [36] study information diffusion of various topics in the blogosphere from individual to individual, drawing on the theory of infectious diseases. They associate ‘read’ probability and ‘copy’ probability with each edge of the blogger graph indicating the tendency to read one’s blog post and copy it, respectively. They also parameterize the stickiness of a topic which is analogous to the virulence of a disease. An interesting problem related to viral marketing [4, 37] is how to maximize the total influence in the network (of blog sites) by selecting a fixed number of nodes in the network. A greedy approach can be adopted to select the most influential node in each iteration after removing the selected nodes. This greedy approach outperforms PageRank, HITS and ranking by number of citations, and is robust in filtering splogs (spam blogs) [38] (More on spam blogs is discussed in Chapter 4). Leskovec et al. [39] proposed a submodularity based approach to identify the most important blogs, which outperforms the greedy approach. Nakajima et al. [40] attempts to find agitators, who stimulate discussions; and summarizers, who summarize discussions, by thread detection. Watts and Dodds [41][42] studied the “influentials hypothesis” using computer simulations of interpersonal influence processes and found that large cascades of influence are driven by a critical mass of easily influenced individuals.

However, as we mentioned before, blogosphere suffers from the challenges of link sparsity due to its casual nature. Often bloggers do not cite the source they referred to write their blog post. This creates an extremely sparse structure and provides challenges in identifying influential nodes through purely network based approaches. Researchers are exploring the possibility of constructing implicit links using content similarity and temporal dimension. In the next section, we look at approaches that focus on various statistics derived from the content on the blogs to identify influential blogs/bloggers.

3.1.2 CONTENT BASED APPROACH

Blogs, as opposed to friendship networks, contain a humongous source of textual content; they present opportunities to exploit textual content in order to identify the influentials. We will discuss some techniques, which leverage the content to derive statistics that help in identifying the influentials.

Representative blog posts are the entries that represent the theme of the blog site, and can be identified by looking at the content [43]. Such entries should not only be representative in content but also diverse so that they cover most of the topics discussed on the blog site. Specifically, the problem of identifying representative blog posts from a blog Bi can be stated as, select a subset Si of blog posts such that |Si| = min(k, Ni) and SiBi, where Ni is the total number of blog posts in the blog Bi, and k is a user specified parameter that sets the number of representative entries required from a blog. For each blog post BijSi, we make sure that they are representative and diverse.

If each blog post is represented in a vector space model, then clustering is performed that groups the blog posts in clusters. Only sufficiently large clusters are retained while other clusters are discarded as noise. A centroid for each of the retained cluster is computed as,

Images

where c*p is the cluster centroid of cluster cp and 1 ≤ pk, Ncp the size of the cluster cp, and Bij the vector space model of the blog post j of blog i. Given the cluster centroids, the representativeness of a blog post Bij can be computed as,

Images

where Bij is a blog post that belongs to the cluster cp, sim a similarity function that could be computed using cosine similarity between the vector space model of Bij and the cluster centroid c*p of the cluster cp, and l(Bij) the number of distinct words in the blog post Bij. The diversity of the selected subset of blog posts Si is computed as,

Images

where Bij and Bik are the selected blog posts and belong to the subset Si, and dist a distance function that computes the distance between any two blog posts.

Based on the above measures, the task of identifying representative and diverse blog posts is reduced to evaluating the following function,

Images

where r(Si, Bi) denotes the representativeness of the blog posts in the subset Si, and d(Si) the diversity of the blog posts in the subset Si. Finding a subset Si such that f(Si, Bi) is maximized can be written as,

Images

where k is the top-k blog posts that the user is interested in. Maximizing this function results in a set of blog posts Si that are most representative and diverse. The above problem is combinatorial optimization, which is often NP-hard to find the global optima. Sub-optimal solutions can be computed for such a problem using a greedy strategy as follows: start with an empty set Si; at each iteration add a blog post Bij, such that the difference f(SiBij) − f(Si) is maximized; keep on expanding the set Si until k blog posts are included.

3.1.3 HYBRID APPROACH

We now present a system iFinder [44] that leverages both content-driven statistics and graph information to identify influential bloggers. Some of the desirable properties of an influential blog post are summarized as follows:

An Initial Set of Intuitive Properties: According to [25], one is influential if he/she is recognized by fellow citizens, can generate follow-up activities, has novel perspectives or ideas, and is often eloquent. Below, we examine how these social gestures describing the characteristic properties of the influential can be approximated by some collectable statistics.

Recognition: An influential blog post is recognized by many. This can be equated to the case that an influential post p is referenced in many other posts. The influence of those posts that refer to p can have different impact: the more influential the referring posts are, the more influential the referred post becomes. Recognition of a blog post is measured through the inlinks (ι) to the blog post. Here ι denotes the set of blogs/blog posts that link to blog post p.

Table 3.1: Social Gestures for Identifying Influential Bloggers and their Corresponding Collectable Statistics.

Images

Activity Generation: A blog post’s capability of generating activity can be indirectly measured by how many comments it receives, or the amount of discussion it initiates. In other words, few or no comment suggests little interest of fellow bloggers, thus non-influential. Hence, a large number of comments (γ) indicates that the post affects many; such that they care to write comments, and therefore, the post can be influential. There are increasing concerns over spam comments that do not add any value to the blog posts or blogger’s influence. Fighting spam is a topic of Chapter 4.

Novelty: Novel ideas exert more influence as suggested in [25]. Hence, the outlinks (θ) is an indicator of a post’s novelty. If a post refers to many other blog posts or articles, it indicates that it is less likely to be novel. A blog post p is less novel if it refers to more influential blog posts than if it refers to less influential blog posts. Here θ refers to the set of blogs/blog posts that blog post p refers or links to.

Eloquence: An influential person is often eloquent [25]. This property is most difficult to approximate using statistics. Given the informal nature of the blogosphere, there is no incentive for a blogger to write a lengthy piece. Hence, a long blog post often suggests some necessity of doing so. Therefore, we use the length of a post (λ) as a heuristic measure for checking if a post is influential or not. Eloquence of a blog post could be gauged using more sophisticated linguistic based measures.

The above four form an initial set of properties possessed by an influential post. We summarize these social gestures and their corresponding collectable statistics in Table 3.1. There are certainly some other potential properties. It is also evident that each of the above four may not be sufficient on its own, and they should be used jointly in identifying influential bloggers. high θ and a poor λ could identify a “hub” blog post.

Blog-post influence can be visualized in terms of an influence graph or i-graph in which the influence of a blog post flows among the nodes. Each node of an i-graph represents a single blog post characterized by the four properties (or parameters): ι, θ, γ and λ. i-graph is a directed graph with ι and θ representing the incoming and outgoing influence flows of a node, respectively. Hence, if I denotes the influence of a node (or blog post p), then InfluenceFlow through node p is given by,

Images

Figure 3.1: i-graph showing the Influence Flow across blog post p.

Images

where win and wout are the weights that can be used to adjust the contribution of incoming and outgoing influence, respectively; pm denotes all the blog posts that link to the blog post p, where 1 ≤ m ≤ |ι|; pn denotes all the blog posts that are referred by the blog post p, where 1 ≤ n ≤ |θ|; |ι| and |θ| are the total numbers of inlinks and outlinks of post p. InfluenceFlow measures the difference between the total incoming influence of all inlinks and the total outgoing influence by all outlinks of the blog post p. InfluenceFlow accounts for the part of influence of a blog post that depends upon inlinks and outlinks. From Eq. 3.15, it is clear that the more inlinks a blog post acquires the more recognized it is, hence the more influential it is; and an excessive number of outlinks jeopardizes the novelty of a blog post which affects its influence. We illustrate the concept of InfluenceFlow in the i-graph displayed in Figure 3.1. This shows an instance of the i-graph with a single blog post. Here we are measuring the InfluenceFlow across blog post p. Influence flows from the left (inlinks) through p to the right (outlinks). We add up the influence “coming into” p and the influence “going out” of p and take the difference of these two quantities to get the p’s influence.

As discussed earlier, the influence (I) of a blog post is also proportional to the number of comments (γp) posted on that blog post. We can define the influence of a blog post, p, as,

Images

where wcom denotes the weight that can be used to regulate the contribution of the number of comments (γp) toward the influence of the blog post p. We consider an additive model because an additive function is good to determine the combined value of each alternative [45]. It also supports preferential independence of all the parameters involved in the final decision. Since most decision problems like the one at hand are multi-objective, a way to evaluate trade-offs between the objectives is needed. A weighted additive function can be used for this purpose [46].

From the discussion on intuitive properties (or social gestures), we consider blog post quality as one of the parameters that may affect influence of the blog post. Although there are many measures that quantify the goodness of a blog post such as fluency, rhetoric skills, vocabulary usage, and blog content analysis, we here use the length of the blog post as a heuristic measure of the goodness of a blog post for the sake of simplicity. We define a weight function, w, which rewards or penalizes the influence score of a blog post depending on the length (λ) of the post. The weight function could be replaced with appropriate content and literary analysis tools. Combining Eq. 3.15 and Eq. 3.16, the influence of a blog post, p, can thus be defined as,

Images

The above equation gives an influence score to each blog post. Note that the four weights (win, wout, wcom, and w) can take more complex forms and can be tuned.

Now we consider how to use I to determine whether a blogger is influential or not. A blogger can be considered influential if he/she has at least one influential blog post. We use the blog post with maximum influence score as the representative and assign its influence score as the blogger influence index or iIndex. There could be other ways of determining the influentials. For example, if one wants to differentiate a productive influential blogger from non-prolific one, one might use another measure. For a blogger B, we can calculate the influence score for each of B’s N posts and use the maximum influence score as the blogger’s iIndex, or

Images

where 1 ≤ iN. With iIndex, we can rank bloggers at a blog site. The top k among the total bloggers are the most influential ones. Thresholding is another way to find influential bloggers. However, determining a proper threshold is crucial to the success of such a strategy and requires more research and understanding of domain. Blog posts whose influence score is higher than the that of the top-kth influential blogger could be termed as influential blog posts.

Computing Blogger Influence with Matrix Operations: We have described a hybrid model to compute the influence of a blog post using both content as well as network statistics. Here we convert the computational procedure into basic matrix operations for convenient and efficient implementation.

We define the inlinks and outlinks to the blog posts using a link adjacency matrix A where an entry Aij is 1 if pi links to pj and 0 otherwise, defined as

Images

Matrix A denotes the outlinks between the blog posts. Consequently, AT denotes the inlinks between the blog posts. We define the vectors for blog post length, comments, influence, and influence flow as, Now, Eq. 3.15 can be rewritten in terms of the above vectors as,

Images

respectively.

Now, Eq. 3.15 can be rewritten in terms of the above vectors as,

Images

and Eq. 3.17 can be rewritten as,

Images

Eq. 3.20 can be rewritten using Eq. 3.19, which can then be solved iteratively,

Images

The above equation requires A to be a stochastic matrix [47], which means all the blog posts must have at least one outlink. In other words, none of the rows in A has all the entries as 0. Otherwise, the influence score for such a blog post would be directly proportional to the number of comments. However, in the blogosphere, this assumption does not hold well. Blog posts are sparsely connected. This problem can be fixed by making A stochastic. This can be achieved by one of the following:

1. Removing those blog posts with no outlinks and the edges that point to these blog posts while computing influence scores. This does not affect the influence scores of other blog posts since the blog posts with no outlink do not contribute to the influence score of other blog posts.

2. Assigning 1/N in all the entries of the rows of such blog posts in A. This implies a dummy edge with uniform probability to all the blog posts from those blog posts, which do not have a single outlink.

For a stable solution of Eq. 3.21,A must be aperiodic and irreducible [47]. A graph is aperiodic if all the paths leading from node i back to i have a length with highest common divisor as 1. One can only link to a blog post which has already been published and even if the blog post is modified later, the original posting date still remains the same. We use this observation to remove cycles in the blog posts by deleting those links that are part of a cycle and point to the blog posts that were posted later than the referring post. This guarantees that there would be no cycles in A, which makes A aperiodic. A graph is irreducible if there exists a path from any node to any node. Using the second strategy mentioned above by adding dummy edges to make A stochastic, ensures that A is also irreducible.

Input   : Given a set of blog posts P, number of iterations iter, Similarity threshold τ.

Output: The influence vector, Images which represents the influence scores of all the blog posts in P.

1 Compute the adjacency matrix A;

2 Compute vectors Images Images;

3 Initialize ImagesImages;

4 repeat

5      Images

6      iteriter − 1;

7 until (cosine_similarity(Images, Images < τ) ∨ (iter ≥ 0);

Algorithm 2: Compute the influence scores of a set of blog posts using power iteration method.

 

The influence scores of blog posts can be computed by solving Eq. 3.21 using an iterative method. iFinder starts with little knowledge and at each iteration iFinder tries to improve the knowledge about the influence of the blog posts until it reaches a stable state or a fixed number of iterations specified a priori. The knowledge that iFinder starts with is the initialization of the vector Images. There are several heuristics that could be used to initialize Images. One way to initialize the influence score of all the blog posts is to assign each blog post uniformly a number, such as 0.5. Another way could be to use inlink and outlink counts in some linear combination as the initial values for Images. In iFinder, authority scores from Technorati, which are available through their API (http://technorati.com/developers/api/cosmos.html), were used. values to initialize Images but since we compare our results with PageRank algorithm we do not use it as the initial scores to maintain a fair comparison.

The computation of influence score of blog posts can be done using the well known power iteration method [48]. The underlying algorithm of iFinder can be described as: Given the set of blog posts P, {p1, p2, . . . , pN}, we compute the adjacency matrix A, and vectors Images and Images. The influence vector Images is initialized to Images using Technorati’s authority values. Using Eq. 3.21 and Images, Images is computed. At every iteration, we use the old value of Images to compute the new value Images. iFinder stops iterating when a stable state is reached or the user specified number of iterations is reached, whichever is earlier. The stable state is judged by the difference in Images and Images, measured by cosine similarity. The algorithm is presented in Algorithm 2.

3.1.4 BLOG LEADERS

Researchers have also studied various forms of blog leaders in the blogosphere. We now analyze these different types of blog leaders and compare with influential bloggers.

Based on the content type, blogs can be categorized into two classes, “Affective Blogs” and “Informative Blogs” [49]. Affective blogs are those that are more like personal accounts and diaries form of writings. Informative blogs are more technology oriented, news related, objective, and high-quality information blogs. Training a binary classifier on a hand-labeled set of blogs using Naïve Bayes, SVM, and Rocchio classifier, affective blogs can be separated from informative blogs. However, there could be influential bloggers who write affective blogs, which would be missed by such a classification.

Another type of blog leader is, who brings in new information, ideas, and opinions, then disseminate them down to the masses. This type of blog leader is known as “Opinion Leader” [50]. Their blogs are ranked using a novelty score, measured by the difference in the content of the given blog post and ones that the given blog post refers. First, the blog posts are reduced to topic space using Latent Dirichlet Allocation (LDA), and then using cosine similarity, measure between these transformed blog posts, and then using cosine similarity measure, a novelty score is computed between these transformed blog posts. Novelty score of a blog post is defined as the dissimilarity in content of the given blog post with respect to other blog posts. It is computed by averaging the cosine similarity scores of the given blog post with respect to the other blog posts. The lower the average cosine similarity score between the given blog post and the other blog posts, the higher the novelty score of the given blog post. However, opinion leaders can be different from influential bloggers. There could be a blogger who is not very novel in his/her content but attracts a lot of attention to his/her posts through comments and feedback. These bloggers will not be captured by novelty based approach. Moreover, not many blogs refer to the blogs they borrowed their content from, due to the casual nature of the blogosphere.

Many blog sites list “Active Bloggers” or top blog posts in some time frame (e.g., monthly). Those top lists are usually based on some traffic information (e.g., how many posts a blogger posted, or how many comments a blog post received) [31]. Certainly, these statistics would leave out those blog sites or bloggers who were not active. Moreover, influential bloggers are not necessarily active bloggers at a blog site.

3.2 TRUST

The past couple of years witnessed significant changes in the interactions between the individuals and groups. Individuals flock to the Internet and engage in complex social relationships, forming social networks. This has changed the paradigm of interactions and content generation. Social networking has given a humongous thrust to online communities, like Blogosphere. Trust is extremely important in social media because of its low barriers to credibility. Profiles and identities could be easily faked, and trust could be compromised, leading to severely critical physical and/or psychological losses.

Trust can be defined as the relationship of reliance between two parties or individuals. Alice trusts Bob implies Alice’s reliance on the actions of Bob, based on what they know about each other. Trust is basically prediction of an otherwise unknown decision made by a party or an individual based on the actions of another party or individual. Trust is always directional and asymmetric. Alice trusts Bob does not imply Bob also trusts Alice.

Trust can be broadly categorized into three classes [51]. When we act towards others based on the belief that their actions will suit our needs and expectations, we are indulging in anticipatory trust. The act of entrusting a valued object to a third party and expecting responsible care involves responsive trust. When we act on the belief that our trust will be reciprocated by the other person, it is a case of evocative trust. There are two major types of trust. The most commonly studied type is interpersonal trust. This involves face-to-face commitments between individuals. The second type is social trust, which involves faceless commitments towards social objects that may involve individuals in the background who are most likely unknown to us.

From a sociological perspective, trust is the measure of belief of one party in another’s honesty, benevolence, and competence. Absence of any of these properties causes failure of trust. From a psychological perspective, trust can be defined as the ability of a party or an individual to influence the other. The more trusting someone is the more easily he/she can be influenced. As mentioned earlier there is a subtle difference between influence and trust. In this section we discuss what is meant by trust and related issues, e.g., how it is computed in the blogosphere, and how it propagates in the blogosphere.

3.2.1 TRUST COMPUTATION

Quantifying and computing trust in social networks is hard because concepts like trust are fuzzy, and trust is being expressed in a social way. In other words, the definitions and properties are not mathematical formulations but social ones. Due to the low barrier to publication and casual environment, there is a strict need for handling trust in social networks. However, there is limited study in computing trust in the blogosphere. Existing works like [52],[53],[54] rely on one form or the other of network centrality measures (like degree centrality, closeness centrality, betweenness centrality, eigenvector centrality) to compute nodes’ trust values. Nonetheless, blog networks have very sparse trust information between different pairs of nodes because there is no explicit notion of specifying trust values to different blogs or bloggers. Although little research has been published that exploits text mining to evaluate trust in Blogosphere, authors in [55] have proposed to use sentiment analysis of the text around the links to other blogs in the network to compute trust scores. They study the link polarity and label the sentiment as “positive”, “negative”, or “neutral”, as illustrated in Figure 3.2. The highlighted text in the blog snippets are the links, and the underlined words/phrases denote the sentiment towards the link. Based on the words/phrases, sentiments can be identified as positive, negative, or neutral. A positive sentiment towards a link increases the trust value of the linked source, and a negative sentiment decreases the trust value of the linked source. In other words, a negative sentiment increases the distrust value of the linked source. This information mined from the blogs is coupled with Guha et al.’s [52] trust and distrust propagation approach to derive trust values between node pairs in the blog network. They further use this model to identify the distrusted nodes in the blog network and filter them out as the spam blogs. More on spam blogs will be discussed in Chapter 4.

Images

Figure 3.2: Link polarity in terms of positive, negative, and neutral sentiments.

Another approach to compute trust purely using content analysis is presented in [56]. A primary assumption of this work is that communications between individuals build trust. The more you engage your audience in interaction, the more trust you build. Various constructs are proposed to quantify the trust, including quality of effort, benevolent intent, liking, involvement, and cultural tribalism. The quality of effort refers to the earnest and conscientious activity intended to accomplish something. It is analyzed by the number of words used by the blogger, follow-up comments by the blogger, and number of trackbacks to the blog post. The benevolent intent measures the expression of good-will in the blogger’s post. Harvard-Lasswell IV (H-L4) tag dictionary is used to quantify the benevolent intent from the blogger’s post. The tag dictionary is a collection of words that have been pre-assigned to one or more thematic categories. One of these themes refers to the benevolent intent of the text. Depending on the number of words used by the blogger that correspond to the “benevolent” theme, we can quantify the benevolent intent from the blogger’s post. Liking is antecedent of trust. Using the H-L4 tag dictionary, we can quantify the “liking” theme from user comments on the blogger’s post. Involvement is a direct measure of the communications or the interactions initiated by a blogger’s post. Involvement can be quantified by looking at the number of comments a blog post received, number of words in the comments, and number of unique individuals who added comments. Cultural tribalism directly facilitates the organizational learning, which is the process of improving actions through better knowledge and understanding. Through discussion and interaction in the blogs, the followers or readers of the blog learn and improve. It not only builds trust for the blogger but also serves as a motivation for blogging. This can be measured as the percentage of the individuals who took part in the discussion (or comments), say, this week also took part in the discussion last week. It can also be measured as the percentage of individuals who took part in the discussion this week also took part ever in the discussion before. All these constructs are quantified and a trust score is computed for the blogger.

3.2.2 TRUST PROPAGATION

Even though we assign trust scores to individual blogs or bloggers, it is still challenging to propagate these scores in the blog network. This is due to the same reason that the blog network is extremely sparse since many bloggers do not cite the source. Using trust propagation approaches for such a sparse network is extremely challenging. Note that trust is highly subjective; nevertheless, some characteristic properties are pointed:

Transitivity: Trust can propagate through different nodes following transitive property. However, the degree of trust does not remain same. It may decrease as the path length increases through which trust propagates.

Asymmetry: Trust is asymmetric, in the sense that if A trusts B then it is not necessary that B also trusts A. Some existing approaches relax this assumption and consider trust as symmetric.

Personalization: Trust is a personalized concept. Everyone has a different conception of trust with respect to some other individual. Assigning a global trust value to an individual is highly unrealistic. Trust of an individual is always evaluated with respect to some other individual.

Trust can be considered as binary-valued with 1 indicating trust and 0 indicating distrust. Trust can also be evaluated as continuous-valued. Moreover, binary-valued trust is little more complicated than meets the eye. A value of 0 could be a little vague as it could represent both no-opinion or distrust. To qualify this notion, often researchers use -1 to represent distrust and 0 as missing value or no-opinion. Researchers model the propagation of distrust the same way as the propagation of trust. Propagation of trust (T) and distrust (D) could be governed by the set of rules illustrated in Table 3.2. Here A, B, and C are different individuals and trust or distrust relationship between A-B and B-C is known. These rules help in inferring trust or distrust between A-B. Propagation of distrust is a little intricate. As shown in the Table 3.2, if A distrusts B and B distrusts C then A has reasons for either trusting C (enemy of enemy is a friend) or distrusting C (don’t trust someone who is not trusted by someone you don’t trust).

Table 3.2: Rules for Trust and Distrust propagation (based on [57]).

Images

In case the link between A and C, like B is missing, which can be used to infer the trust between A-C, a different strategy could be used. Trust only if someone is trusted by k people, i.e., if C is trusted by a k′ number of people then A could trust C. Don’t trust anyone who is distrusted by k′ people, i.e. if C is distrusted by k′ number of people then A could distrust C. Note that the thresholds k and k′ could be learned from the data.

Trust is a promising area of research in social networks especially the blogosphere where most of the assumptions from friendship networks are absent.

1. Social friendship networks assume initial trust values are assigned to the nodes of the network. Unless some social networking websites allow their members to explicitly provide trust ratings for other members, it is a topic of research and exploration to compute initial trust scores for the members. Moreover, in Blogosphere it is even harder to implicitly compute initial trust scores.

2. Social friendship networks assume an explicit relationship between members of the network. However, in Blogosphere there is no concept of explicit relationship between bloggers. Many times these relationships have to be anticipated using link structure in the blogs or blogging characteristics of the bloggers such as content similarity.

3. Existing approaches for trust propagation algorithms assume an initial starting point. In Blogosphere, where both network structure and initial ratings are not explicitly defined, it is challenging to tackle the trust aspect. A potential approach could be to use influential members [44] of a blog community as the seeds for trusted nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset