Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Opinion Spam Detection in Social Networks

G. Fei; H. Li; B. Liu University of Illinois at Chicago, Chicago, IL, United States

Abstract

As social media websites have emerged as popular platforms for sharing and spreading real-time information on the Internet, impostors see huge opportunities in taking advantage of such systems to spread distorted information. Online social networks and review websites have become targets of opinion spamming. More and more traditional review websites allow users to “friend” or “follow” each other so as to enhance overall user experience. This has brought significant advances of opinion spam detection using users’ social networks or heterogeneous networks of various entities in a broader sense. In this chapter, different techniques are introduced, such as belief propagation and collective positive-unlabeled learning, that leverage the intricate relations between different entities in the network. We discuss these methods via the application of spam detection on review-hosting websites and popular social media platforms.

Keywords

Spam detection; Belief propagation; Positive-unlabeled learning; Markov random fields; Fake reviews; Social networks

Acknowledgments

This work was supported in part by a grant from the National Science Foundation under grant no. IIS-1407927, a NCI grant under grant no. R01CA192240, and a gift from Bosch. The content of this chapter is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation, NCI, or Bosch.

1 Introduction

The increasing popularity of social media such as Facebook and Twitter has provided a great platform for campaign promoters to conduct targeted campaign and for spammers to spread misleading information. Opinions in reviews or social feeds are increasingly used by individuals and organizations to make purchase decisions, for marketing, and for product design. Positive opinions often mean profit and fame for businesses and individuals, which, unfortunately, provides strong incentives for imposters to post fake reviews to promote or to discredit some target products or services. Such individuals are called opinion spammers and their activities are called opinion spamming [1]. Researchers focusing on analyzing one review or one reviewer at a time neglected the potential relationships among multiple reviews or reviewers [1–7]. In the past few years, more researchers have started taking a different approach by modeling the relationship among different entities in the reviewing system such as reviewers and products so as to gain deeper insight into the problem itself and achieve better results [8–11]. In this chapter we introduce the most recent approaches for detecting opinion spam that incorporate user-user collaborations and in the framework of collective positive-unlabeled (CPU) learning.

2 Related Work

This chapter focuses on review spam, but social relations between other forms of spammers and nonspammers have been studied by many researchers [12, 13]. Their work showed that acquiring followers for a user not only increases the size of the audience but also boosts the ranking of the user’s tweets. Although some promoters behave in a way similar to spammers, there are also a large number of promoters who are participating in a campaign legitimately, especially in nonprofit campaigns. SPOT [14] can catch suspicious Twitter profiles by learning content-based features from text and malicious URLs in tweets and scoring their suspiciousness. Aggarwal and Kumaraguru [15] noticed that purchased Twitter follower accounts have a difference in their interaction and content-sharing patterns in comparison with random Twitter users. Other recent studies on detecting social spammers include [16–18]. Aggarwal [17] deploys social honeypots to obtain deceptive spam profiles from social networking communities. These spam profiles can be used for future statistical analysis of their properties to create better spam classifiers. SYBILRANK [19] relies on social graph properties to rank users according to their perceived likelihood of being sybils. SSDM [20] is a sociological framework that models social network information and content information for spammer detection. Those methods paved the way for the models we are going to introduce in the following sections.

3 Review Spammer Detection Leveraging Reviewing Burstiness

In this section we discuss one approach to fight against spamming reviews on review websites by exploiting the burstiness nature of online reviews [8]. In normal situations, reviews for a product should arrive randomly. However, there are areas (time periods) where the reviews for a product are bursty, meaning that there are sudden concentrations of reviews in these areas or periods. These areas are called review bursts. This method makes some important assumptions about the formation of review bursts and uses them to build connections among seemingly unrelated reviewers so as to detect spamming reviewers/reviews. It is different from other methods in that it tries to directly model the reviewer-reviewer relationship without using any social network information such as “friend” or “follow” information.

This approach assumes that bursts in reviews can be due to either the sudden popularity of products/services or spam attacks. For example, a TV may suddenly become popular because of a successful TV commercial, followed by a large number of customers purchasing the TV and writing reviews for it in a short period. In this case, most reviewers appearing within this burst of reviews tend to be honest reviewers. On the other hand, if a product is undergoing spam attacks, a number of spamming or fake reviews may be posted in a short period. Spam attacks usually involve multiple reviews. Compared with multiple fake reviews, a single one can hardly make enough impact to change the overall sentiment toward the product. And multiple reviews may be posted in a short period because of group spam [6], meaning a group of spammers working together to promote or demote a product. These two possibilities lead to an important hypothesis about bursts in reviews; that is, reviewers co-occurring in the same bursts tend to have a similar nature, meaning that they are either mostly spammers or mostly genuine reviewers.

This approach involves two independent steps. The first step is review burst detection, which looks for dense areas in the arrival pattern of reviews. The second step is review spammer detection. Given the review bursts detected in the first step and the reviewers appearing in the bursts, it models the reviewer-reviewer relationship using Markov random fields (MRFs) [21] and employs loopy belief propagation (LBP) [22] to infer the real identities of reviewers.

3.1 Burst Detection

We first introduce review burst detection. As mentioned already, this step detects bursts in reviews, allowing the second step to model the reviewers appearing within these bursts. As the modeling is based on several important assumptions regarding the properties of bursts, detecting real bursts is crucial to the performance of the spammer detection algorithm. However, since it is a step independent of the real spammer detection and is thus beyond the scope of this chapter, we will only briefly introduce one method, which is based on kernel density estimation [23]. Kernel density estimation is a nonparametric way to estimate the probability density of a random variable on the basis of a finite data sample. Given a sample S = {x_i}_i=1…N from an unknown distribution, its density function f(x) can be estimated with Eq. (9.1):

$\begin{array}{l} \hat{f} (x) = \frac{1}{N} \sum_{i = 1}^{N} K_{h} (x - x_{i}), \end{array}$ $\begin{array}{l} \hat{f} (x) = \frac{1}{N} \sum_{i = 1}^{N} K_{h} (x - x_{i}), \end{array}$

si1_e (9.1)

where K_h is called the scaled kernel with a bandwidth (scale) h such that K_h(t) = hK(t/h). K is called the kernel, which should satisfy K(u) ≥ 0 and $\int K (u) d u = 1$ $\int K (u) d u = 1$ . One can think of Eq. (9.1) as estimating the probability distribution function by averaging the effect of a set of kernel functions centered at each data point. In kernel density estimation a range of kernel functions are commonly used. In this method, a Gaussian kernel is used.

Assume we have a product p that has a set of m reviews {p₁, …, p_m}, and each review is associated with a review date {t₁, …, t_m}. The method first chooses a bin size that is equal to a time period (eg, 2 weeks). Then the reviews are separated into subintervals on the basis of their time stamps to generate a histogram of review counts. The last step is to fit the histograms via kernel density estimation and select bursts by the setting of proper burst thresholds.

3.2 Spammer Detection Under Review Bursts

Now we introduce the algorithm for detecting spammers within bursts. The algorithm uses several important assumptions about review bursts and models the co-occurrence of reviewers using a probabilistic graphical model called MRFs. We start by introducing MRFs.

3.2.1 Reviewer modeling using Markov random fields

MRFs (also called Markov networks) are undirected graphical models that deal with inference problems with uncertainty in observed data. MRFs work on an undirected graph G = (V, E), where each vertex or node v_i ∈ V represents a random variable and each edge (v_i, v_j) represents a statistical dependency between the pair of variables indexed by i and j. A set of potential functions is defined on the cliques of the graph to measure compatibility among the nodes involved. MRFs thus define a joint distribution over all the nodes in the graph/network. Each random variable can be in any of a finite number of states S and is independent of other random variables given its immediate neighbors. The inference task is to compute the maximum likelihood assignment of states; that is, classes for classification. In this approach, the network of reviewers are represented by pairwise MRFs [24]. Instead of imposing potential functions on large cliques, the potential functions in pairwise MRFs are over single variables and pairs of variables (or edges). The state of a node is assumed to be dependent only on its neighbors and to be independent of all other nodes in the graph. We use ϕ_i(σ_i) to denote the potential function on a single variable (indexed by node i), indicating the prior belief that the random variable v_i is in state σ_i. We also call it the prior of the node. We use ψ_{i, j}(σ_i, σ_j) to denote the potential that node i is in state σ_i and node j is in state σ_j for the edge of the pair of random variables (v_i, v_j), which is also called the compatibility potential. To reduce the number of model parameters, the compatibility potentials are often shared.

Now we show how this approach models the reviewers in bursts and their co-occurrences as pairwise MRFs. Reviewers co-occur if they wrote reviews in the same bursts. Each reviewer is represented by a node in the graph, and has a state to indicate his/her real but as yet unknown identity, which can take any of three values; spammer, mixed identity, and nonspammer. The reason for our using mixed identity as a state is because some reviewers sometimes write fake reviews for profit and at other times are legitimate buyers and write genuine reviews. The co-occurrence between two reviewers within the same burst is represented by an edge between their corresponding nodes. So all reviewers that appear in the same burst form a clique. To simplify the problem the approach does not distinguish how many times two reviewers appear in the same bursts. To completely define MRFs, the prior potentials and compatibility potentials need to be defined.

Node prior potentials

Node prior potentials, which represent initial class probabilities for each node, are often initialized on the basis of prior knowledge. If no external information or domain knowledge is available, one can set all the priors as unbiased/uninformative (ie, equal probability for each class). In this approach, several spammer behavior features serve as the prior knowledge;

• Rating deviation: A reasonable reviewer is expected to give ratings similar to other reviewers of the same product. Ratings given by spammers can be quite different from those give by other reviewers. We define the rating deviation of a reviewer as

$\begin{array}{l} RD (a) = \underset{p \in P_{a}}{avg} \frac{|r_{a p} - {\bar{r}}_{a p}|}{4}, \end{array}$ $\begin{array}{l} RD (a) = \underset{p \in P_{a}}{avg} \frac{|r_{a p} - {\bar{r}}_{a p}|}{4}, \end{array}$

si3_e (9.2)

where r_ap is the rating given by reviewer a toward product p ∈ P_a, and ${\bar{r}}_{a p}$ ${\bar{r}}_{a p}$ refers to the average rating of the product given by other reviewers. We normalized the value by 4, which is the maximal possible rating deviation on a 5-star rating scale.

• Burst review ratio: This feature computes the ratio of a reviewer’s reviews in bursts to the total number of reviews that he/she wrote. Since we expect the arrival of normal reviews to be random, if a reviewer has a high proportion of reviews in bursts, he/she is likelier to be a spammer. The burst review ratio of reviewer a is computed as

$\begin{array}{l} BRR (a) = \frac{|B_{a^{*}}|}{|V_{a^{*}}|}, \end{array}$ $\begin{array}{l} BRR (a) = \frac{|B_{a^{*}}|}{|V_{a^{*}}|}, \end{array}$

(9.3)

where $B_{a^{*}}$ $B_{a^{*}}$ represents the set of reviews that reviewer a wrote that have appeared in review bursts.

• Review content similarity: We define v_a as the set of reviews that reviewer a wrote, and v_{a, i} is his/her ith review. Review content similarity measures the average pairwise similarity of all reviews that a reviewer wrote. Since spammers normally do not spend as much time as genuine reviewers in writing a completely new review, the words they choose every time are expected to be similar. A bag-of-words model and cosine similarity can be used to represent each review text and measure the similarity as shown in Eq. (9.4) for the review content similarity for reviewer a:

$\begin{array}{l} RCS (a) = \underset{v_{a, i}, v_{a, j} \in V_{a} : i < j}{avg} cos (v_{a, i}, v_{a, j}) \end{array}$ $\begin{array}{l} RCS (a) = \underset{v_{a, i}, v_{a, j} \in V_{a} : i < j}{avg} cos (v_{a, i}, v_{a, j}) \end{array}$

(9.4)

• Reviewer burstiness: If the reviews appearing in some product review bursts happen to be the reviews in a reviewer’s own review burst, he/she is likelier to be a spammer. Reviewer burstiness is computed as

$\begin{array}{l} RB (a) = \{\begin{array}{l} 0, & L (B_{a^{*}}) - F (B_{a^{*}}) > λ, \\ 1 - \frac{L (B_{a^{*}}) - F (B_{a^{*}})}{λ}, & otherwise, \end{array} \end{array}$ $\begin{array}{l} RB (a) = \{\begin{array}{l} 0, & L (B_{a^{*}}) - F (B_{a^{*}}) > λ, \\ 1 - \frac{L (B_{a^{*}}) - F (B_{a^{*}})}{λ}, & otherwise, \end{array} \end{array}$

si8_e (9.5)

where $L (B_{a^{*}})$ $L (B_{a^{*}})$ and $F (B_{a^{*}})$ $F (B_{a^{*}})$ are the latest time and the earliest time respectively for the reviews that reviewer a wrote that appear in the burst, and λ is the time window parameter representing a burst in a customer’s own review pattern. λ is set equal to 2 months on the basis of the observation in [25].

Given a set of values for the features of node i as prior knowledge, the next step is compute the prior potential ϕ_i, which is a real-valued vector of size 3, representing the prior probability of node i being a spammer, a reviewer with mixed identity, and a nonspammer. The expected value of the above features is first computed to represent the overall spamming indicator. One way this can be done for a node with three states is as follows.

Given a normalized value ω, we use a Gaussian distribution N(ω, σ²) to compute a reviewer’s probability of being a spammer, a reviewer with mixed identity, and a nonspammer as in Eq. (9.6).

$\begin{array}{l} p (x_{i} | ω) = k \int_{\frac{1}{3} * i}^{\frac{1}{3} * (i + 1)} f (t) d t, (i = 0, 1, 2), \end{array}$ $\begin{array}{l} p (x_{i} | ω) = k \int_{\frac{1}{3} * i}^{\frac{1}{3} * (i + 1)} f (t) d t, (i = 0, 1, 2), \end{array}$

si11_e (9.6)

where f(t) is the density function of N(ω, σ²), and x_i is the random variable representing the possible state of each reviewer, with x₀ representing nonspammer, x₁ representing mixed identity, and x₂ representing spammer. k is the normalization factor such that the sum of the three probabilities equals 1. In this method, σ = 0.25.

Edge compatibility potentials

On the other hand, the compatibility potentials are ideally estimated from labeled data. In a problem setting such as opinion spam detection, however, obtaining a reasonable amount of labeled data is extremely difficult because of the challenges that human annotators face. Therefore these parameters are set on the basis of intuition reflecting the hypothesized properties of different types of reviewers within the same bursts. Under these assumptions a sample instantiation of the compatibility potentials is shown in Table 9.1.

Table 9.1

An Example Propagation Matrix

	Spammer	Nonspammer	Mixed Identity
Spammer	0.4	0.25	0.35
Nonspammer	0.25	0.4	0.35
Mixed identity	1/3	1/3	1/3

t0010

This instantiation is based on the following intuition. In a review burst a spammer is most likely to co-occur with other spammers so as to have a major impact on the sentiment toward the product being reviewed. Because reviewers with mixed identity could also act as spammers, a spammer is likelier to appear with them than with genuine reviewers. Similarly, genuine reviewers are likelier to appear with other genuine reviewers because of the possibility that the product becomes popular suddenly; and they are likelier to appear with reviewers with mixed identity than with heavy spammers. However, a reviewer with mixed behavior is equally likely to appear with spammers, reviewers with mixed identity, or nonspammers.

3.2.2 Reviewer identity inference using loopy belief propagation

Given the reviewer network represented as pairwise MRFs and on which the potential functions are defined, the next step is to compute the posterior probability over the states/labels of each node. For specific graph topologies such as chains, trees, and other low tree width graphs, there are efficient algorithms for exact inference. However, for a general graph, the exact inference is computationally intractable. Therefore approximate inference is typically used. One of the most popular algorithms is the LBP algorithm, which is extended from the belief propagation algorithm.

The belief propagation algorithm was first proposed Pearl [26] to find exact marginals on trees. The same algorithm can be applied to general graphs that contain loops [27]. The algorithm is thus called LBP. However, LBP is not guaranteed to converge to the correct marginal probabilities. But Murphy et al. [22] indicate that it often converges and the marginals are a good approximation to the correct posteriors.

The key idea of LBP is iterative message passing between the connected nodes. The message represents the belief of i about j (ie, what i “thinks” j’s label distribution is). More formally, m_{i, j} captures the probability distribution over the class labels of j, and is computed on the basis of the class priors of i, the compatibility potentials of the edge that connects i and j, and the messages that i receives from its neighbors excluding j. Eq. (9.7) gives the formula for message passing:

$\begin{array}{l} m_{i, j} (σ_{j}) = Z_{1} \sum_{σ_{i} \in S} ψ_{i, j} (σ_{i}, σ_{j}) ϕ_{i} (σ_{i}) \prod_{k \in N (i) ∖ j} m_{k, i} (σ_{i}), \end{array}$ $\begin{array}{l} m_{i, j} (σ_{j}) = Z_{1} \sum_{σ_{i} \in S} ψ_{i, j} (σ_{i}, σ_{j}) ϕ_{i} (σ_{i}) \prod_{k \in N (i) ∖ j} m_{k, i} (σ_{i}), \end{array}$

si12_e (9.7)

where Z₁ is the normalization constant and σ_j is one component of the message $m_{i \to j} (σ_{j})$ $m_{i \to j} (σ_{j})$ , which is proportional to the likelihood of node j being in state σ_j given the evidence from i in all possible states σ_i. N(i) represents all the neighbors of node i. Eq. (9.7) is called the sum-product algorithm because the inner product is over the messages from other nodes to node i and the outer summation sums over all states that node i can take. At the beginning of LBP, all messages are initialized to 1. Then the messages of each node from its neighbors are alternately updated until the messages stabilize or the threshold of a maximum number of iterations is reached. The final belief b_i(σ_i) of a node i is a vector of the same dimension as the message that measures the probability of node i being in state σ_i. The belief of node i is the normalized messages from all its neighbors shown in Eq. (9.8):

$\begin{array}{l} b_{i} (σ_{i}) = Z_{2} ϕ_{i} (σ_{i}) \prod_{k \in N (i)} m_{k, i} (σ_{i}), \end{array}$ $\begin{array}{l} b_{i} (σ_{i}) = Z_{2} ϕ_{i} (σ_{i}) \prod_{k \in N (i)} m_{k, i} (σ_{i}), \end{array}$

si14_e (9.8)

where Z₂ is the normalization factor that ensures $\sum_{σ_{i}} b_{i} (σ_{i}) = 1$ $\sum_{σ_{i}} b_{i} (σ_{i}) = 1$ .

In the end, for each node the state with the highest belief score is chosen as the final state.

4 Detecting Campaign Promoters on Twitter

As social media platforms have become more and more important for marketing and advertising, detecting malicious promoters in social media campaigns is also an important task in detecting social media spam. Since tweets can be posted and accessed from a wide range of web-enabled services, real-time propagation of information to a large audience has become the focus of merchants, governments, and even malicious spammers. Unlike campaigns on traditional mass media platforms, social media campaigns often influence people in a hidden or implicit manner without disclosing their true intention. The readers are thus often unaware that the messages they see are strategic campaign posts aimed at persuading them to buy some target products/services or to accept some target ideas or ideologies. It is thus important to discover such campaigns, their promoter accounts, and how the campaigns are organized and executed. The goal is to classify two types of user accounts, those involved in promotion and those not involved.

Similarly to the approach introduced in the last section, this algorithm [28] also formulates the promoter detection problem with MRFs. Because Twitter allows only 140-character-long messages (called tweets), they are often too short for effective promotion of targeted products/services. Promotional tweets typically have to include URLs pointing to the full messages, which may include pictures, audio, and videos. Three different types of entities are considered jointly to detect campaign promoters; namely, users, URLs, and bursts. And different types of entities have different states. For example, a user can be either a promoter or a nonpromoter; a URL can be either a promoted or an organic URL; and a burst can be either a planned or a normal burst.

To jointly model different types of entities (different types of nodes) and their relationship, this algorithm generalizes the classic MRF to the typed MRF (T-MRF), which allows nodes of different types connecting to each other in the network. The relationship between different entities is based on the following intuition. If one user has a higher probability of being a promoter, then the URLs in his/her tweets are likely to be promoted URLs. Likewise, the burst that he/she is in is likely to be a planned burst. Such relationships can be modeled in the T-MRF, and the above intuition is the key for defining the potential functions in a T-MRF.

4.1 Campaign Promoter Modeling Using Typed Markov Random Fields

In contrast to a classic MRF, a T-MRF defines a graph with typed nodes T = {t₁, t₂…, t_n}. Each t_i is an element of a finite set of types H; that is, t_i ∈ H. For example, we use three node types in this work; that is, H ={user, URL, burst}. Each type of node represents a type of entity of interest (eg, user, URL, or burst in our case), and the edges between nodes represent their dependency relationships. For example, there will be a user-URL edge if a user’s tweet contains a certain URL. Again, two kinds of potential functions are defined on the T-MRF: the prior potential ϕ_i(σ_i|t_i) and the compatibility potential ψ_{i, j}(σ_i, σ_j|t_i, t_j). The compatibility potential gives the probability of a node v_j of type t_j being in the state σ_j given its neighboring node v_i of type t_i being in state σ_i. Now we introduce how prior potentials and compatibility potentials are set.

4.1.1 Node prior potentials

In this method, there are three types of nodes and each type of node needs a different way of estimating priors given the available prior knowledge:

• URL and burst prior: Using the same strategy, one can classify a URL into the promoted or organic class. However, labeling URLs is difficult because there are usually a large number of tweets containing a URL, which increases the cost of labeling tremendously. Instead, one can obtain directly estimates of class/state probabilities for URL nodes using the labels of users. The idea is if a URL is tweeted more by promoters than by nonpromoters, it is believed to be promoted. URLs that are tweeted neither by labeled promoters nor by labeled nonpromoters have equal probabilities of being in the two states. Even if there are many more unique URLs than labeled users, the number of popular URLs in the campaign could be estimated approximately. Similarly, the prior belief of a burst can be estimated. Planned bursts are dominated by promoters, while natural bursts are dominated by normal users.

4.1.2 Compatibility potentials

Since we have three types of nodes, the interactions or dependencies among different nodes (or random valuables) are also different. There are thus six kinds of edges in this model, user-URL, user-burst, user-user, burst-burst, URL-burst, and URL-URL. However, only four are found to be useful: user-URL, user-burst, URL-burst, and user-user (see Table 9.2). Compatibility potentials for these four types of edges are defined on the basis of the following intuition:

Table 9.2

Propagation Matrix ψ_{i, j}(σ_i, σ_j|t_i, t_j) for Each Type of Edge Potential

	t_j = URL
t_i = User	Promoted	Organic
Promoter	1 − 2ϵ	2ϵ
Nonpromoter	2ϵ	1 − 2ϵ
(a)
	t_j = Burst
t_i = User	Planned	Normal
Promoter	0.5 + ϵ	0.5 − ϵ
Nonpromoter	0.5 − ϵ	0.5 + ϵ
(b)
	t_j = Burst
t_i = URL	Planned	Normal
Promoted	0.5 + ϵ	0.5 − ϵ
Organic	0.5 − ϵ	0.5 + ϵ
(c)
	t_j = User
t_i = User	Promoter	Nonpromoter
Promoter	0.5 + ϵ	0.5 − ϵ
Nonpromoter	0.5	0.5
(d)

t0015

• User-URL potentials: A user and a URL form an edge if the user has tweeted the URL at least once. If a URL is heavily promoted, the users who tweet the URLs are likely to be promoters. In contrast, URLs that are relatively less promoted are usually mentioned by nonpromoters. URLs in the tweets of promoters are called promoted URLs.

• User-burst potentials: A user and a burst form an edge if the user posted at least one tweet in the burst. The arrival of a large number of tweets forming a burst is either a natural reaction to a successful campaign or a deliberate promoting activity from real promoters and/or their Twitter bots. We assume planned bursts contain primarily promoters and normal bursts are mostly formed by normal users who are attracted by the campaign.

• URL-burst potentials: A URL and a burst form an edge if the URL has been tweeted at least once in the burst. To maximize the influence of a campaign, campaign promoters have to continuously post tweets to maintain the advertising balance for URLs of interest. Similarly to user-burst potentials, URLs mentioned within a planned burst are likely to be promoted and URLs in a normal burst are likely to be organic.

• User-user potentials: Rather than working alone, campaign promoters can be well organized. A group of campaign accounts that work collaboratively can attract a greater audience and increase their credibility. If the group of accounts is not consider collectively, it is often difficult to detect some individual promoters because of insufficient features. Thus links between users are created if two users are considered similar in their behaviors. Similarity can be measured on the basis of the content similarity of their tweets, URL similarity, and “following” similarity. The last two similarity measures can be computed on the basis of the Jaccard coefficient.

4.2 Inference

Similarly to the inference algorithm in the first algorithm, LBP is used to approximate the posterior probability over the states/labels of each node. In the T-MRF the message passing assignment of LBP is as shown in Eq. (9.9):

$\begin{array}{l} m_{i, j} (σ_{j} | t_{j}) = Z_{1} \sum_{σ_{i} \in S} ψ_{i, j} (σ_{i}, σ_{j} | t_{i}, t_{j}) ϕ_{i} (σ_{i} | t_{i}) \prod_{k \in N (i) ∖ j} m_{k, i} (σ_{i} | t_{i}) . \end{array}$ $\begin{array}{l} m_{i, j} (σ_{j} | t_{j}) = Z_{1} \sum_{σ_{i} \in S} ψ_{i, j} (σ_{i}, σ_{j} | t_{i}, t_{j}) ϕ_{i} (σ_{i} | t_{i}) \prod_{k \in N (i) ∖ j} m_{k, i} (σ_{i} | t_{i}) . \end{array}$

si16_e (9.9)

Eq. (9.10) shows the final belief b_i(σ_i|t_i) of a node i of type t_i, which is a vector of the same dimension as the message that measures the probability of node i of type t_i being in state σ_i:

$\begin{array}{l} b_{i} (σ_{i} | t_{i}) = Z_{2} ϕ_{i} (σ_{i} | t_{i}) \prod_{k \in N (i)} m_{k, i} (σ_{i} | t_{i}) . \end{array}$ $\begin{array}{l} b_{i} (σ_{i} | t_{i}) = Z_{2} ϕ_{i} (σ_{i} | t_{i}) \prod_{k \in N (i)} m_{k, i} (σ_{i} | t_{i}) . \end{array}$

si17_e (9.10)

4.3 Opinion Spam Evaluation

One of the major obstacles toward review spammer detection is the evaluation because ground truth data are hard to obtain for comparison with the model output. Here we introduce two approaches to opinion spam/spammer evaluation:

• Supervised text classification: Supervised text classification can be considered as complementary to human evaluation. In spam review detection, we can treat the spam reviews as belonging to the positive class and nonspam reviews as belonging to the negative class. A classifier can then be built to separate the two classes of reviews. This evaluation can be applied only when an underlying assumption holds. It assumes that a set of features different from the text features are used in the spam detection step. In the review classification, we use purely linguistic features. If the classification shows good accuracy, we know that the reviews written by reviewers labeled as spammers and nonspammers on the basis of their behaviors are also separable on the basis of their review text. The mixed class is not used in this evaluation because it contains a mixture of spam and nonspam reviews, which are harder to separate.

5 Spotting Spammers Using Collective Positive-Unlabeled Learning

Almost all the social media platforms and review-hosting websites have built their own spam detection systems based on either hand-crafted rules or machine learning algorithms. But all of them are facing the critical problem that they tend to only spot spammers or spam that is extremely abnormal. The high precision and unknown recall of filtered reviews leads to a positive-unlabeled (PU) learning problem. Li et al. [29] applied PU learning algorithms to a real-life fake review dataset shared by Dianping.com, which is the equivalent of Yelp in China. By taking the application of PU learning to Dianping as an example, we discuss why PU learning is more suitable than traditional supervised learning in datasets where label precision is high and recall is unknown. Li et al. [30] proposed the CPU learning method to perform the task; it outperforms existing PU learning models as it leverages the intricate dependencies among different entities. In this section, our discussion will be mainly about CPU learning.

5.1 Problem Definition

Dianping’s filtering system has a very high precision, but unknown recall. This means almost all fake reviews detected are certainly fake but the remaining ones may not all be genuine. As reviews passed by its system may still contain many fake reviews, it is more appropriate to treat negative examples as an unlabeled set. In this section we will first introduce some notation and concepts and then define the CPU learning problem in a heterogeneous network.

A heterogeneous network is defined as $G = (V, T, E)$ $G = (V, T, E)$ , where V = {v₁, v₂, …, v_|V|} is a set of nodes representing the entities of different types and E is the set of edges incident on V. e_{i, j} ∈ E if v_i has a link to v_j. In Dianping’s problem, the heterogeneous network is defined as an undirected graph; that is, if e_{i, j} ∈ E, then e_{j, i} ∈ E. T = {t₁, t₂, …, t_|V|} is the set of corresponding entity types of nodes. Each t_i is an element of a finite set of types Γ (ie, t_i ∈ Γ). For example, in the defined heterogeneous network, there are three types of nodes; users, reviews, and IP addresses. That is, Γ = {user, review, IP address}. The edges between nodes represent their dependency relationships. Fig. 9.1 schematically shows three types of nodes and some edges between them.

f09-01-9780128044124 — Fig. 9.1 Sample network of the users, reviews, and IP addresses.

Each node v_i is associated with a class label y_i, which is an element of a set of states $S_{t_{i}}$ $S_{t_{i}}$ that the node belongs to with respect to its entity type t_i. Thus we have $y_{i} \in S_{t_{i}}$ $y_{i} \in S_{t_{i}}$ . In the PU learning setting, a review has three states; fake (positive class), truthful (negative class), and a special state called unlabeled. A user has two states: spammer and nonspammer. An IP address can be either suspicious or organic.

5.2 Collective Classification

CPU learning uses an iterative classification algorithm (ICA) [31] as its underlying relational classifier. While a conventional classifier is a function f that maps the observed feature matrix $X$ $X$ onto the class label space $Y$ $Y$ , the ICA breaks the independence assumption. Consider a node v_i whose class label $y_{i} \in Y$ $y_{i} \in Y$ needs to be determined and suppose a neighborhood function $N_{i} = {j | e_{i, j} \in E}$ $N_{i} = {j | e_{i, j} \in E}$ returns the indices of its neighbors. $M_{i} = {\hat{y_{j}} | j \in N_{i}, \hat{y_{j}} \in \hat{Y}}$ $M_{i} = {\hat{y_{j}} | j \in N_{i}, \hat{y_{j}} \in \hat{Y}}$ is the class label vector of the neighbors of node i that is derived from $N_{i}$ $N_{i}$ and the estimated class labels matrix $\hat{Y}$ $\hat{Y}$ . The ICA trains a local classifier f, whose input is the observed features x_i of v_i as well as the estimated labels from its neighbors $M_{i}$ $M_{i}$ . Because many labels of nodes are unknown, this process has to be executed iteratively, and in each iteration the label y_i of each node is assigned the current best estimates from the local classifier $y_{i} = f (x_{i}, M_{i})$ $y_{i} = f (x_{i}, M_{i})$ . The base learner used by the ICA can be any conventional classifier that gives probabilistic outputs such as logistic regression or support vector machines.

As the network is heterogeneous, Li et al. [30] divide the network into multiple networks, each of which is homogeneous. Nodes in the resulting networks are connected if they share common neighbors in the original network. Unlike the ICA, CPU learning allows initial labels to be violated if the current probability estimate strongly indicates the opposite prediction. This is especially useful for the mining of fake reviews that are identified by the collective classifier but not identified by Dianping’s filter [32]. During each iteration, the model produces new estimates of nodes for all three types. For each type of node, the probability is assumed to be a normal distribution such that a set of reliable positive and reliable negative data instances can be extracted. Once a reliable positive data instance is identified, its label is changed from unlabeled to positive. However, relying on the fact that Dianping’s filter has a very high precision, the label of a reliable negative is updated only if its label was previously unlabeled. Algorithm 9.1 outlines the CPU learning model.

Algorithm 9.1

Collective Positive-Unlabeled Learning Algorithm

Input: Heterogeneous network $G = {V, T, E}$ $G = {V, T, E}$ Training dataset indices A Testing dataset indices D Feature matrix $X^{R}, X^{U}, X^{I}$ $X^{R}, X^{U}, X^{I}$ Labels $Y_{A}^{R} = {y_{i} | i \in A, y_{i} \in {+, u}}, Y^{U}, Y^{I}$ $Y_{A}^{R} = {y_{i} | i \in A, y_{i} \in {+, u}}, Y^{U}, Y^{I}$ Output: $Y_{D}^{R} = {y_{i} | i \in D, y_{i} \in {+, -}}$ $Y_{D}^{R} = {y_{i} | i \in D, y_{i} \in {+, -}}$ Note: superscripts R, U, I stand for reviews, users, and IPs respectively ______________________________________________________________________________________________

1: $N$ $N$ , $\hat{Y}$ $\hat{Y}$ = Initialize( $G, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ $G, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ )

2: while $\hat{Y}$ $\hat{Y}$ not stabilized and maximal iterations have not elapsed do

3: $M = \emptyset$ $M = \emptyset$ //adjacent matrix of relational features

4: $\hat{Y}$ $\hat{Y}$ = Predict( $N, M, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ $N, M, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ )

5: end while

6: Output $Y_{D} = {\hat{y_{i}} | i \in D, \hat{y_{i}} \in \hat{Y}}$ $Y_{D} = {\hat{y_{i}} | i \in D, \hat{y_{i}} \in \hat{Y}}$

7: _________________________________________________________________________________________________________

8: procedure Predict( $N, M, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ $N, M, A, D, X^{R}, X^{U}, X^{I}, Y_{A}^{R}, Y^{U}, Y^{I}$ )

9: for i ∈{1, 2, …, |V |} do

10: $M_{i} = {\hat{y_{j}} | j \in N_{i}, \hat{y_{j}} \in \hat{Y^{R}} \cup \hat{Y^{U}} \cup \hat{Y^{I}}}$ $M_{i} = {\hat{y_{j}} | j \in N_{i}, \hat{y_{j}} \in \hat{Y^{R}} \cup \hat{Y^{U}} \cup \hat{Y^{I}}}$

11: Append $M_{i}$ $M_{i}$ to $M$ $M$

12: end for

13: Train a logistic regression classifier f^R from $X_{A}^{R}$ $X_{A}^{R}$ , $Y_{A}^{R}$ $Y_{A}^{R}$ and $M$ $M$

14: Train a logistic regression classifier f^U from $X^{U}$ $X^{U}$ , $Y^{U}$ $Y^{U}$ and $M$ $M$

15: Train a logistic regression classifier f^I from $X^{I}$ $X^{I}$ , $Y^{I}$ $Y^{I}$ and $M$ $M$

16: //update $\hat{Y^{R}}$ $\hat{Y^{R}}$ , $\hat{Y^{U}}$ $\hat{Y^{U}}$ , $\hat{Y^{I}}$ $\hat{Y^{I}}$

17: $\hat{Y^{R}} = \emptyset$ $\hat{Y^{R}} = \emptyset$ , $\hat{Y^{U}} = \emptyset$ $\hat{Y^{U}} = \emptyset$ , $\hat{Y^{I}} = \emptyset$ $\hat{Y^{I}} = \emptyset$

18: for i ∈ {1, 2, …, |V |} do

19: k = t_i //t_i ∈ {R, U, I} t_i is the entity type

20: $\hat{y_{i}} = f^{k} (x_{i}, M_{i})$ $\hat{y_{i}} = f^{k} (x_{i}, M_{i})$

21: Append $\hat{y_{i}}$ $\hat{y_{i}}$ to $\hat{Y^{k}}$ $\hat{Y^{k}}$

22: end for

23: Compute the mean μ_k and σ_k of $\hat{Y^{k}}$ $\hat{Y^{k}}$ of all three types

24: for i ∈ {1, 2, …, | V |} do

25: k = t_i // t_i ∈ {R, U, I} t_i is the entity type

26: if $\hat{y_{i}} > = μ_{k} + σ_{k}$ $\hat{y_{i}} > = μ_{k} + σ_{k}$ then // reliable positive data

27: update $Y^{k}$ $Y^{k}$ with label y_i set to +

28: if i ∈ D then

29: //reliable positive in testing reviews for training

30: move i from D to A

31: end if

32: else if $\hat{y_{i}} < μ_{k} - σ_{k}$ $\hat{y_{i}} < μ_{k} - σ_{k}$ then // reliable negative data

33: if i ∈ D or y_i is not + then

34: update $Y^{k}$ $Y^{k}$ with label y_i set to −

35: end if

36: if i ∈ D then

37: //reliable negative in testing reviews for training

38: move i from D to A

39: end if

40: end if

41: end for

42: return $\hat{Y}$ $\hat{Y}$

43: end procedure

5.3 Model Evaluation

The effectiveness of the CPU learning model was evaluated on gold standard data from Dianping’s system.¹ Tenfold cross validation was performed for each model. F₁, precision, recall, and accuracy are reported in Table 9.3. Compared with the other collective classification models ICA and MHCC [33], the CPU learning model achieves much better classification results (F₁ scores, precision, and accuracy) than nonrelational classifiers, which classify each instance independently. Nonrelational PU learning models (PU-LEA [34], Spy-EM [35], Spy-SVM [35]) have higher recall than the traditional supervised classifier logistic regression (LR) despite some loss in precision. This is because Dianping’s unlabeled reviews contain hidden positives that provide purer positive and negative instances for training.

Table 9.3

Results of 10-Fold Cross-Validation

	Accuracy	Precision	Recall	F₁
LR	0.679	0.557	0.536	0.546
PU-LEA	0.664	0.53	0.611	0.567
Spy-EM	0.595	0.468	0.891	0.613
Spy-SVM	0.626	0.488	0.733	0.585
ICA	0.788	0.761	0.601	0.671
MHCC	0.804	0.777	0.643	0.702
CPU	0.840	0.829	0.700	0.759

t0020

CPU, collective positive-unlabeled; ICA, iterative classification algorithm; LR, logistic regression.

5.4 Trends and Directions

Although many algorithms have been proposed to detect fake reviews, there is still a long way to go before we can weed out opinion spamming activities. There are also many interesting research directions that have not been or have barely been explored. Here we describe several such directions based on [36]:

• Multiple site comparisons: Spammers who promote or demote some target products may write fake reviews on multiple sites to achieve the maximum effect as many websites may sell the products or host reviews about the products. It is interesting to compare reviews of the same product across these sites to discover abnormalities; for example, similar reviews (contents or ratings) that are written at about the same time, similar user ids, and the same or a similar IP address. It is quite unlikely for a genuine reviewer to be bothered to post a positive review for a product on multiple sites.

• Language inconsistency: To suit different products and to stress personal experiences, fake reviewers may write something that is inconsistent or against social norms in different reviews. For example, to promote a woman’s watch, a spam reviewer may write “I bought this watch for my wife and she simply loved it.” Later, to promote a man’s shirt, the same reviewer may write “I bought this shirt for my husband yesterday.” The two sentences indicate gender inconsistency. There are many other possible inconsistencies that can be detected (eg, age, with/without children, ethnicity).

• Nature of business: In many cases the nature of the product or business can help to detect fake reviewing activities. For example, if all positive reviews of a hotel come from people living in the area near the hotel (on the basis of their IP addresses), these positive reviews are clearly suspicious because few people would stay in a hotel if they have homes nearby. However, there are so many types of businesses, and manually compiling such normal and abnormal behavior patterns is very difficult. The challenge is how we can design an algorithm to automatically discover such abnormal behavioral patterns for each domain.

• Web use abnormality: Web servers record almost everything that a person does on a website, which is valuable for detecting fake reviews. For example, using the IP address information and click behaviors, we may find that all positive reviews for a product are from the same or a similar IP address, or many reviewers who have never visited the website suddenly came to a product page directly and wrote positive reviews. In the first case, all the reviews might have come from the same person, and in the second case, the seller of the product might have paid people to write reviews for the product and also provided a link to the product page. The fake reviewers simply clicked on the link and posted fake reviews. Again, there are a large number of abnormal web use behavior patterns, and the challenge is how to automatically mine such patterns.

• Early detection: Most existing algorithms rely on patterns detected from review contents and reviewers’ behaviors. It takes some time for most types of patterns to form. However, when the patterns finally show themselves, some major damage might already have been done. Thus it is important to design algorithms that can detect fake reviews as soon as possible, ideally immediately after they have been posted.

6 Conclusion

In this chapter we have discussed the use of network information from various social media entities in opinion spam detection. In particular, two techniques were introduced. MRFs combined with belief propagation are applied to model and infer the user-user, user-URL, and user-burst relationships, and CPU learning is employed to make use of existing commercial spam detection systems whose filtered reviews have high precision but unknown recall. The increasing dynamics and number of ways users may interact on social media websites brings both challenges and opportunities for researchers and engineers who fight against opinion spam. With an increasing number of ways users may interact on social media, opinion spammers may have more ways to subtly manipulate the systems and influence genuine users. At the same time, this gives us more room to design more accurate filtering systems by combining spammers’ footprints from different aspects of their social media interactions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Opinion Spam Detection in Social Networks

Create new playlist

Sign In

Sign Up