This chapter provides readers with further readings in related areas of this book. These areas are closely related with the problems, theories, and techniques that are reviewed in the book. Readers are recommended to take the material presented in this chapter as references if they want to explore related problems from a broader perspective. Examples of the reviewed areas in this chapter include: estimation theory, data quality and trust analysis, outlier analysis and attack detection, recommender systems, surveys, and opinion polling.
In estimation theory, expectation maximization (EM) is a general optimization technique for finding the maximum likelihood estimation (MLE) of parameters in a statistic model where the data are “incomplete” or involve latent variables in addition to estimation parameter and observed data [93]. That is, either there are some missing value among the data, or the model can be formulated more simply by assuming the existence of some unobserved data. In many cases, the EM is used for parameter estimation of mixture distributions where the latent variable inform which mixture is active [136]. The general EM algorithm iterates between two main steps: the Expectation step (E-step) and the Maximization step (M-step) until the estimation converges (i.e., the likelihood function reaches the maximum). In the E-step, the algorithm computes the expectation of the log-likelihood function (so-called Q-function) of complete data w.r.t. the conditional distribution of the latent variables given the current settings of the parameters and the observed data. In the M-step, it re-estimates the parameters in the next iteration that maximizes the expectation of the log-likelihood function defined in the E-step. EM is frequently used for data clustering in data mining and machine learning because the collection of clusters can be modeled as one mixture distribution. For language modeling, the EM is often used to estimate parameters of a mixed model where the exact model from which the data is generated is unobservable [137]. EM has also been used in PLSA [138, 139] and community detection [140, 141]. There are also many good tutorials on EM algorithms [142–144]. In this book, we showed that social sensing applications lend themselves nicely to an EM formulation because it is natural to think that each individual source speaks out different for true claims as compared to untrue ones. In other words, a source’s willingness to espouse a claims is drawn from a mixture distribution, where the possible ground truths form the mixtures. The optimal solution, in the sense of MLE, directly leads to an accurate quantification of measurement correctness as well as participant reliability.
The Cramer-Rao lower bound (CRLB) is a fundamental bound used in estimation theory to characterize the lower bound on the estimation variance of a deterministic parameter [109]. The Fisher information is defined as the second moment of the score vector of random variable and estimation parameter [111]. Intuitively, if the Fisher information is large, the distribution with the θ0 (i.e., true value) of the estimation parameter will be different and well distinguished from the distributions with a parameter that is not so close to θ0. This means we are able to estimate θ0 well (hence a small variance) based on the data. If the Fisher information is small, our estimation will be worse due to the similar reason. We reviewed the basics of CRLB and Fisher information in Chapter 3. CRLB has been used to study the performance of estimators in different applications such as range estimation [145], sinusoidal parameter estimation [146], and bearing estimation [147]. For example, Wang et al. leveraged CRLB to estimate the accuracy of time-based range estimation (TBRE) using Orthogonal frequency-division multiplexing (OFDM) [145]. Qian et al. used CRLB to show a frequency domain nonlinear least squares estimation algorithm achieved the near-optimal performance for noisy damped sinusoidal signals [146]. Wang et al. analyzed the performance bounds (i.e., CRLB) of a location-penalized MLE for bearing-only target localization [147]. One of the key properties of MLE is the asymptotic normality. This property basically states that the MLE estimator is asymptotically distributed with a normal distribution as the data sample size increases [112]. The mean of the normal distribution is the MLE of the estimation parameter and the variance is given by the CRLB of the estimation. The asymptotic normality has been recently studied in various contexts such as stochastic blockmodels [148], maximum entropy models [149], Markov jump process [150], and binary neural networks [151]. The EM scheme we reviewed in this book provides the MLE of source reliability for social sensing applications. We also reviewed an quantification approach to compute the confidence interval for source reliability estimation based on both the actual and asymptotic CRLB by leveraging the asymptotic normality of our MLE estimator.
Data quality and integration is a critical problem in the database communities and a number of techniques have been developed. These techniques include methods for detecting erroneous values [152, 153], entity resolution [154, 155], information extraction [156, 157], type inference [158, 159], and schema matching [160, 161]. Besides, an end-to-end data curation system (Data Tamer) has been developed to perform data cleaning and reusable transformation [162]. Direct manipulation and programming-by-demonstration (PBD) methods have been applied to specific cleaning tasks in many data cleaning applications [163–165]. Supervised learning presumes the existence of labeled data for training. Because it is difficult to collect ground truthed data, many researchers have turned to crowd sourcing to label the data. There exists a significant literature in the machine learning community to improve data quality and identify low quality labelers in a multi-labeler environment. In such context, multiple non-expert sources could offer cheap but noisy labels at scale for supervised modeling. Robust techniques have been developed to improve the data quality of using noisy labels. Sheng et al. proposed a repeated labeling scheme to improve label quality by selectively acquiring multiple labels and empirically comparing several models that aggregate responses from multiple labelers [166]. Dekel et al. applied a classification technique to simulate aggregate labels and prune low-quality labelers in a crowd to improve the label quality of the training dataset [167]. However, all of the above approaches made explicit or implicit assumptions that are not appropriate in the social sensing context. For example, the work in [166] assumed labelers were known a priori and could be explicitly asked to label certain data points. The work in [167] assumed most of labelers were reliable and the simple aggregation of their labels would be enough to approximate the ground-truth. In contrast, participants in social sensing usually upload their measurements based on their own observations and the simple aggregation technique (e.g., majority voting) was shown to be inaccurate when the reliability of participant is not sufficient [37]. The MLE approach reviewed in this book addressed these challenges by intelligently casting the reliable social sensing problem into an optimization problem that can be efficiently solved by the EM scheme.
We reviewed several important trust analysis schemes developed to solve fact-finding problems in information networks (i.e., fact-finders) in Chapter 4. They normally depend on the source and claim networks that describe “who said what” to make the trust decision. In addition to those schemes, there exists a large amount of literature on trust analysis that look into attributes of the sources as well as the lexicon, syntax, and semantics of the claims to improve the analysis performance. For example, Pasternack et al. proposed a generalized fact-finding framework that incorporates a wealth of background knowledge and contextual information such as source attributes (e.g., age, educational attainment, groups), claim similarity and the uncertainty in the information extraction of claims [44]. Amin et al. designed an extended version of the MLE based fact-finding framework by explicitly considering source’s bias in their model and showed performance improvement over the state-of-the-art schemes in scenarios where source opinions are polarized [168]. Gupta et al. developed a credibility analysis scheme based on Twitter to identify credible events [169]. Their scheme explored attributes/features of both sources (e.g., number of friends, followers, status updates, profile) and claims (e.g., existence of slang words, supportive URLs, words in first/second/third person pronouns, number of named entities related to the event, sentiment analysis). Vydiswaran et al. developed a content-driven trust propagation framework that helps ascertain the veracity of free-text claims and estimate the trustworthiness of their sources. Their approach explored the evidence related with a claim, uncertainty in the quality of those evidence artifacts, and the information network structure [170]. Finally, Castillo et al. and O’Donovan et al. investigated a combination of different content, social, and behavioral features to assess the credibility of Twitter messages [171, 172].
Several previous efforts on data cleaning and outlier analysis from data mining and noise removal from statistics addressed some notion of noisy data [115, 116, 173–176]. They differ in the assumption made, the modeling approach applied and the targeted objective. For example, Bayesian inference and decision tree induction techniques are applied to fill the missing values of data by predictions from their constructed model [173]. Binning and linear regression techniques are used to smooth the noisy data by either using bin means or fitting data into some linear functions [174, 175]. Clustering techniques are widely used to detect outliers by organizing similar data values into clusters and identifying the ones that fall outside the clusters as outliers [176]. Other approaches are used in statistics to estimate model parameters or filter noises from continuous data [115, 116, 177]. Random Sample Consensus (RANSAC) algorithm is a widely used robust parameter estimation algorithm can potentially deal with a large outlier contamination rate [177]. Kalman filter is an efficient reclusive filter that estimates some latent variables of a linear dynamic system from a series of noisy measurements [115]. It produces estimations of the measurements by computing a weighted average of the predicted values based on their uncertainty. Particle filters are more sophisticated filters that are based on Sequential Monte Carlo methods. They are often used to determine the distribution of a latent variable whose state space is not restricted to Gaussian distribution [116]. Our work is complementary to the above efforts. On one hand, an appropriately cleaned and outlier-removed dataset will likely result in a better estimation of our scheme. On the other hand, outliers or noises may not be completely (or even possibly) removed by the data cleaning and outlier analysis techniques mentioned above due to their own limitations (e.g., linear model assumption, continuous data assumption, known data distribution assumption, etc.). The quantifiable and confident estimation provided by our approach on both information source and observed data could actually help the data cleaning and outlier analysis tools do a better job.
In intrusion detection, one critical task is to detect (or identify) the malicious nodes (or sources) accurately and confidently. Two main kinds of detection techniques exist: signature-based detection and anomaly-based detection [176, 178]. The signature-based detection takes the predefined attack patterns (by domain experts) as signatures and monitor the node’s behavior (or network traffic) for matches to report the anomaly [176]. The anomaly-based detection builds profiles of normal node’s (or network) behavior and use the profiles to detect new patterns that have remarkable deviation [178]. For the reliable social sensing problem in our work, it is not obvious what behavior patterns the malicious (unreliable) sources will have without knowing the correctness of their measurements. Hence, there might not be an easy way to apply the intrusion techniques mentioned above to discover malicious sources for social sensing applications. Instead, given the MLE on participant reliability and the corresponding confidence interval provided by our scheme, we are able to both identify unreliable sources and quantify their reliability with certain confidence without prior knowledge of their behavior patterns.
Since people are an indispensable element in social sensing, some popular attacks originated from human (or source) interactions are interesting to investigate. Collusion attack is carried out by a group of colluded attackers who collectively perform some malicious (sometimes illegal) actions based on their agreement to defraud honest sources or obtain objective forbidden by the system. This attack could be mitigated by monitoring the interactions or relationships among colluded attackers or identifying the abnormal behavior from the group [179]. Sybil attack is another related attack carried out by a single attacker who intentionally create a large number of pseudonymous entities and use them to gain a disproportionately large influence on the system. This attack could be mitigated by certifying trust of identity assignment, increasing the cost of creating identities, limiting the resource the attacker can use to create new identities, etc. [180]. By handling reports from colluded or duplicate sources in a way that takes care of the source dependency, we will be able to address the above attacks to some extent. For example, by identifying duplicate sources, we can remove them along with their reports from the observed dataset, which is expected to improve the estimation performance. Problems become more interesting when sources are not just duplicates but actually linked through some orthogonal information network (e.g., social network). Recent work has investigated theory to characterize ones ability to identify and compensate for attacked nodes within a large scale sensor network performing target detection [181]. These principles may provide a starting point to analyze and characterize sensing over social networks.
Our work is related with a type of information filtering system called recommender systems, where the goal is usually to predict a user’s rating or preference to an item using the model built from the characteristics of the item and the behavioral pattern of the user [182]. EM has been used in either collaborative recommender systems as a clustering module [183] to mine the usage pattern of users or in a content-based recommender systems as a weighting factor estimator [184] to infer the user context. However, the reliable social sensing problem targets a different goal: we try to quantify how reliable a source is and identify whether a measured variable is true or not rather than predict how likely a user would choose one item compared to another. Moreover, users in recommender systems are commonly assumed to provide reasonably good data while the sources in social sensing are in general unreliable and the likelihood of the correctness of their measurements is unknown a priori. There appears no straightforward use of methods in the recommender systems regime for the target problem with unpredictably unreliable data. Additionally, the rating or preference we get from users in the recommender systems are sometimes subjective [185]. For example, some people may prefer Ford car to Toyota while others prefer exactly the opposite. It is hard to say who is right and who is wrong due to the fact that there is no universal ground truth on the items to be evaluated. We note that the work in this book may not be directly applicable to handle the above case due to the different assumptions made in models for truth finding. In social sensing applications, we aim to leverage the data contributed by common individuals and reconstruct the state of the physical world, where we usually do have the universal ground truth associated with the assertions that describe those physical states (e.g., a building is either on fire or not). The techniques reviewed in our book make much more sense under this assumption of social sensing applications. It enables the application to not only obtain the optimal estimation (in MLE sense) on source and information reliability, but also assess the quality of the estimation compared to the ground truth.
Surveys and influence analysis are often subjective [186]. They tend to survey personal facts, or individual emotions and sentiments [187]. This is as opposed to assessing physical state that is external to the human (sensor). For example, a survey question may ask “Was the customer service representative knowledgeable?” or it may ask “Do you support government’s decision to increase tax?” Survey participants answer the questions with their own ideas independently, and the responses are often private [188]. Source dependency is not the main issue in these studies [189]. In contrast, in this book, it is not our goal to determine what individuals feel, think, or support, or to extract who is influential, popular, or trending. Instead of assessing humans’ own beliefs, opinions, popularity, or influence, we focus on applications concerned with the observation and state estimation of an external environment. That external state has a unique ground truth that is independent of human beliefs. Humans act merely as sensors of that state. There is therefore an objective and unambiguous notion of sensing error, leading to a clear optimization problem whose goal is to reconstruct ground truth with minimum error from reported human observations.
The work reviewed in this book should not be confused with work from sociology and statistics on opinion polling, opinion sampling, influence analysis, and surveys. Opinion polling and sampling are usually carefully designed and engineered by the experts to create appropriate questionnaires and select representative participants [190, 191]. These are often controlled experiments, and the provenance of the information is also controllable [192]. Moreover, data cleaning is domain specific and semantic knowledge is required [193]. In contrast, in the reliable sensing problem studied in this book, the data collection is open to all. We assume no control over both the participants (data sources) and the measurements in their reports. The reliability of sources and their data provenance is usually unknown to the applications. The approaches reviewed in this book are designed to be general and not require domain specific knowledge to clean the data.