Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Confidence bounds in social sensing

Abstract

In the previous chapter, we reviewed a maximum likelihood estimator (MLE) to jointly estimate the reliability of sources and determine the correctness of facts concluded from the data. However, an important problem that remains unanswered from the MLE approach is: what is the confidence of the resulting source reliability estimation? Only by answering this question, can we completely characterize estimation performance, and hence source reliability in social sensing applications. This chapter reviews analytically-founded bounds that quantify the accuracy of such MLE in social sensing. It is shown that the estimation confidence can be quantified accurately based on both real and asymptotic Cramer-Rao lower bound (CRLB). Additionally, this chapter also reviews an estimator on the accuracy of claim classification without knowing the ground truth values of the claims.

Keywords

Confidence intervals

Cramer-Rao lower bound (CRLB)

Reliability assurance

Scalability

Robustness

Assured social sensing

6.1 The Reliability Assurance Problem

In this chapter, we introduce a reliability assurance problem in social sensing, where the estimation-theoretic approaches are used to quantify the accuracy of the maximum likelihood estimation (MLE) framework presented in the previous chapter [103, 104]. In particular, the goal is to demonstrate, in an analytically-founded manner, how to compute the confidence interval of each source’s reliability. Formally, this is given by:

$\begin{array}{l} ({\hat{t}}_{i}^{M L E} - c_{p}^{l o w e r}, {\hat{t}}_{i}^{M L E} + c_{p}^{u p p e r}) c % i = 1, 2, \dots, M \end{array}$

(6.1)

where ${\hat{t}}_{i}^{M L E}$ is the MLE on the reliability of source S_i ,c% is the confidence level of the estimation interval, $c_{p}^{l o w e r}$ and $c_{p}^{u p p e r}$ represent the lower and upper bound on the estimation deviation from the MLE ${\hat{t}}_{i}^{M L E}$ , respectively. The goal is to find $c_{p}^{l o w e r}$ and $c_{p}^{u p p e r}$ for a given c% and an observation matrix SC. It turns out that the Cramer-Rao lower bounds (CRLBs) of the MLE on the source reliability need to be computed in order to obtain the $c_{p}^{l o w e r}$ and $c_{p}^{u p p e r}$ . Therefore, the goal of the reviewed work in this chapter is to: (i) derive the actual and asymptotic error bounds that characterize the accuracy of the MLE and compute its confidence interval; (ii) estimate the accuracy of claim classification without knowing the ground truth values of the variables; and (iii) derive the dependency of the accuracy of MLE on parameters of the problem space.

In this chapter, we show how to derive the confidence interval for source reliability through the computation of the CRLB for the estimation parameters (i.e., θ) and by leveraging the asymptotic normality of the MLE. We start with the review of the actual CRLB derivation and identify its scalability limitation. We then review the derivation of the asymptotic CRLB that works for the sensing topology with a large number of sources. We review the confidence interval on source reliability based on the derived CRLBs. Additionally, we also review the derivation of the expected number of mis-classified claims (i.e., false claim classified as true and true claim classified as false).

6.2 Actual Cramer-Rao Lower Bound

In this section, we first review the derivation of the actual CRLB that characterizes the estimation performance of the MLE of source reliability in social sensing. Similarly as the previous section, the reliability of sources is assumed to be imperfect and the majority of claims are assumed to be true. In estimation theory, the CRLB expresses a lower bound on the estimation variance of a minimum-variance unbiased estimator. In its simplest form, the bound states the variance of any unbiased estimator is at least as high as the inverse of the Fisher information [111]. The estimator that reaches this lower bound is said to be efficient. For notational convenience, the observation matrix SC is denoted as the observed data X and use X_ij = S_iC_j for the following derivation.

The likelihood function (containing hidden variable Z) of the MLE we get from EM is expressed in Equation (5.10), where Z = (z_j| j = 1,2,…,N) represent the hidden variables.

The EM scheme is used to handle the hidden variable and aims to find:

$\begin{array}{l} \hat{θ} = \underset{θ}{arg max} p (X | θ) \end{array}$

(6.2)

where

$\begin{array}{l} p (X | θ) & = \prod_{j = 1}^{N} \{\prod_{i = 1}^{M} a_{i}^{X_{i j}} {(1 - a_{i})}^{(1 - X_{i j})} \times d \\ + \prod_{i = 1}^{M} b_{i}^{X_{i j}} {(1 - b_{i})}^{(1 - X_{i j})} \times (1 - d)\} \end{array}$

si11_e (6.3)

By definition of CRLB, it is given by

$\begin{array}{l} C R L B = J^{- 1} \end{array}$

(6.4)

where

$\begin{array}{l} J = E [▽_{θ} ln p (X | θ) ▽_{θ}^{H} ln p (X | θ)] \end{array}$

(6.5)

where J is the Fisher information of the estimation parameter, $▽_{θ} = (\frac{\partial}{\partial a_{1}}, \dots,$ $\frac{\partial}{\partial a_{M}}, \frac{\partial}{\partial b_{1}}, \dots, \frac{\partial}{\partial b_{M}})^{H}$ and H denotes the conjugate transpose operation. In information theory, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an estimated parameter θ upon which the probability of X depends. The expectation in Equation (6.5) is taken over all values for X with respect to the probability function p(X|θ) for any given value of θ. Let $X$ represent the set of all possible values of X_ij ∈{0,1} for i = 1,2,…,M;j = 1,2,…,N. Note $| X | = 2^{M N}$ . Likewise, let $X_{j}$ represent the set of all possible values of X_ij ∈{0,1} for i = 1,2,…,M and a given value of j. Note $| X_{j} | = 2^{M}$ . Taking the expectation, Equation (6.5) can be rewritten as follows:

$\begin{array}{l} J = \sum_{X \in X} ▽_{θ} ln p (X | θ) ▽_{θ}^{H} ln p (X | θ) p (X | θ) \end{array}$

si20_e (6.6)

Then, the Fisher information matrix can be represented as:

$J = [\begin{array}{l} A & C \\ C^{T} & B \end{array}]$

si21_e

where submatrices A, B, and C contain the elements related with the estimation parameter a_i, b_i, and their cross terms, respectively. The representative elements A_kl, B_kl, and C_kl of A, B, and C can be derived as follows:

$\begin{array}{l} A_{k l} & = E [\frac{\partial}{\partial a_{k}} ln p (X | θ) \frac{\partial}{\partial a_{l}} ln p (X | θ)] \\ = E [(\sum_{j} \frac{(2 X_{k j} - 1) Z_{j}}{a_{k}^{X_{k j}} {(1 - a_{k})}^{(1 - X_{k j})}} \sum_{q} \frac{(2 X_{l q} - 1) Z_{q}}{a_{l}^{X_{l q}} {(1 - a_{l})}^{(1 - X_{l q})}})] \\ = \sum_{j} \sum_{q} E [\frac{(2 X_{k j} - 1) Z_{j} (2 X_{l q} - 1) Z_{q}}{a_{k}^{X_{k j}} {(1 - a_{k})}^{(1 - X_{k j})} a_{l}^{X_{l q}} {(1 - a_{l})}^{(1 - X_{l q})}}] \end{array}$

si22_e (6.7)

where

$Z_{j} = p (z_{j} = 1 | X) = \frac{A_{j} \times d}{A_{j} \times d + B_{j} \times (1 - d)}$

si23_e

where

$\begin{array}{l} A_{j} = \prod_{i = 1}^{M} a_{i}^{X_{i j}} {(1 - a_{i})}^{(1 - X_{i j})} B_{j} = \prod_{i = 1}^{M} b_{i}^{X_{i j}} {(1 - b_{i})}^{(1 - X_{i j})} \end{array}$

si24_e (6.8)

Z_j is the conditional probability of the claim C_j to be true given the observation matrix. After further simplification as shown in Appendix of Chapter 6, A_kl can be expressed as the summation of only the expectation terms where j = q:

$\begin{array}{l} A_{k l} & = \sum_{j} E [\frac{(2 X_{k j} - 1) (2 X_{l j} - 1) Z_{j}^{2}}{a_{k}^{X_{k j}} {(1 - a_{k})}^{(1 - X_{k j})} a_{l}^{X_{l j}} {(1 - a_{l})}^{(1 - X_{l j})}}] \\ = \sum_{j = 1}^{N} \sum_{X \in X j} \frac{(2 X_{k j} - 1) (2 X_{l j} - 1) \prod_{\begin{matrix} i = 1 \\ i \neq k \end{matrix}}^{M} A_{i j} \prod_{\begin{matrix} i = 1 \\ i \neq l \end{matrix}}^{M} A_{i j} d^{2}}{\prod_{i = 1}^{M} A_{i j} d + \prod_{i = 1}^{M} B_{i j} (1 - d)} \end{array}$

si25_e (6.9)

where

$\begin{array}{l} A_{i j} = a_{i}^{X_{i j}} {(1 - a_{i})}^{(1 - X_{i j})} B_{i j} = b_{i}^{X_{i j}} {(1 - b_{i})}^{(1 - X_{i j})} \end{array}$

(6.10)

Since the inner sum in (6.9) is invariant to the claim index j, $A_{k, l} = N Ā_{k, l}$ where $Ā_{k l}$ is:

$\begin{array}{l} Ā_{k l} = \sum_{x \in X j} \frac{(2 X_{k j} - 1) (2 X_{l j} - 1) \prod_{\begin{matrix} i = 1 \\ i \neq k \end{matrix}}^{M} A_{i j} \prod_{\begin{matrix} i = 1 \\ i \neq l \end{matrix}}^{M} A_{i j} d^{2}}{\prod_{i = 1}^{M} A_{i j} d + \prod_{i = 1}^{M} B_{i j} (1 - d)} \end{array}$

si29_e (6.11)

It should also be noted that the summation in Equation (6.11) is the same for all j.

By similar calculations, the inverse of the Fisher information matrix is obtained as follows:

$J^{- 1} = \frac{1}{N} {[\begin{array}{l} Ā & \bar{C} \\ {\bar{C}}^{T} & \bar{B} \end{array}]}^{- 1}$

si30_e

where the klth element of $\bar{B}$ , $\bar{C}$ is defined as:

$\begin{array}{l} {\bar{B}}_{k l} & = \sum_{x \in X j} \frac{(2 X_{k j} - 1) (2 X_{l j} - 1) \prod_{\begin{matrix} i = 1 \\ i \neq k \end{matrix}}^{M} B_{i j} \prod_{\begin{matrix} i = 1 \\ i \neq l \end{matrix}}^{M} B_{i j} {(1 - d)}^{2}}{\prod_{i = 1}^{M} A_{i j} d + \prod_{i = 1}^{M} B_{i j} (1 - d)} \end{array}$

si33_e (6.12)

$\begin{array}{l} {\bar{C}}_{k l} & = \sum_{x \in X j} \frac{(2 X_{k j} - 1) (2 X_{l j} - 1) \prod_{\begin{matrix} i = 1 \\ i \neq k \end{matrix}}^{M} A_{i j} \prod_{\begin{matrix} i = 1 \\ i \neq l \end{matrix}}^{M} B_{i j} d (1 - d)}{\prod_{i = 1}^{M} A_{i j} d + \prod_{i = 1}^{M} B_{i j} (1 - d)} \end{array}$

si34_e (6.13)

Note that the sum of $Ā_{k l}$ , ${\bar{B}}_{k l}$ , and ${\bar{C}}_{k l}$ are over the 2^M different permutations of X_ij for i = 1,2,…,M at a given j. This is much smaller than the 2^MN permutations of $X$ .

This gives the actual CRLB. Note that more claims simply lead to better estimates for θ as the variance decreases as $\frac{1}{N}$ . The decrease in variance for the estimates as a function of M is more complicated, which can only be computed numerically. Please note that the actual CRLB computation needs the true values of the estimation parameter. However in real applications, the true values are not known in advance, we substitute the unknown true values for MLEs as an approximation to estimate variances for determining the confidence bounds.

6.3 Asymptotic Cramer-Rao Lower Bound

Observe that the complexity of the actual CRLB computation in the above subsection is exponential with respect to the number of sources (i.e., M) in the system. Therefore, it is inefficient (or infeasible) to compute the actual CRLB when the number of sources becomes large. In this subsection, we review the derivation of the asymptotic CRLB for efficient computation in the sensing topology with a large number of sources. The asymptotic CRLB is derived based on the assumption that the correctness of the hidden variable (i.e., z_j) can be correctly estimated from EM. This is a reasonable assumption when the number of sources is sufficient [100]. Under this assumption, the log-likelihood function of the MLE obtained from EM can be expressed as follows:

$\begin{array}{l} l_{e m} (x; θ) & = \sum_{j = 1}^{N} \{z_{j} \times [\sum_{i = 1}^{M} (X_{i j} log a_{i} + (1 - X_{i j}) log (1 - a_{i}) + log d)] \\ + (1 - z_{j}) \times [\sum_{i = 1}^{M} (X_{i j} log b_{i} + (1 - X_{i j}) log (1 - b_{i}) + log (1 - d))]\} \end{array}$

si40_e (6.14)

The Fisher information matrix at the MLE was computed from the log-likelihood function given by Equation (6.14). The converged estimates of a_i and b_i from the EM of the previous chapter were used as the MLE.

Plugging l_em(x;θ) given by Equation (6.14) into the Fisher information defined in Equation (6.5), the representative element of Fisher information matrix from N claims was shown as:

$\begin{array}{l} {(J ({\hat{θ}}_{M L E}))}_{i, j} = \{\begin{array}{l} 0 & i \neq j \\ - E_{X} [\frac{\partial^{2} l_{e m} (x; a_{i})}{\partial a_{i}^{2}} |_{a_{i} = a_{i}^{0}}] & i = j \in [1, M] \\ - E_{X} [\frac{\partial^{2} l_{e m} (x; b_{i})}{\partial b_{i}^{2}} |_{b_{i} = b_{i}^{0}}] & i = j \in (M, 2 M] \end{array} \end{array}$

si41_e (6.15)

where $a_{i}^{0}$ and $b_{i}^{0}$ are the true values of a_i and b_i. In the following computation, we estimate them by substituting the known MLEs for the unknown parameter values.

Substituting the log-likelihood function in Equation (6.14) and MLE of θ into Equation (6.15), the asymptotic CRLB (i.e., the inverse of the Fisher information matrix) can be written as:

$\begin{array}{l} {(J^{- 1} ({\hat{θ}}_{M L E}))}_{i, j} = \{\begin{array}{l} 0 & i \neq j \\ \frac{â_{i}^{M L E} \times (1 - â_{i}^{M L E})}{N \times d} & i = j \in [1, M] \\ \frac{{\hat{b}}_{i}^{M L E} \times (1 - {\hat{b}}_{i}^{M L E})}{N \times (1 - d)} & i = j \in (M, 2 M] \end{array} \end{array}$

si44_e (6.16)

Note that the asymptotic CRLB is independent of M under the assumption that M is sufficient, and it can be quickly computed from the MLE of the EM scheme.

6.4 Confidence Interval Derivation

This subsection shows that the confidence interval of source reliability can be obtained by using the CRLB derived in previous sections and leveraging the asymptotic normality of the MLE.

The MLE posses a number of attractive asymptotic properties. One of them is called asymptotic normality, which basically states the MLE estimator is asymptotically distributed with Gaussian behavior as the data sample size goes up, in particular [112]:

$\begin{array}{l} ({\hat{θ}}_{M L E} - θ_{0}) \overset{d}{\to} N (0, J^{- 1} ({\hat{θ}}_{M L E})) \end{array}$

(6.17)

where J is the Fisher information matrix computed from all samples, θ₀ and ${\hat{θ}}_{M L E}$ are the true value and the MLE of the parameter θ, respectively. The Fisher information at the MLE is used to estimate its true (but unknown) value [111]. Hence, the asymptotic normality property means that in a regular case of estimation and in the distribution limiting sense, the MLE ${\hat{θ}}_{M L E}$ is unbiased and its covariance reaches the CRLB (i.e., an efficient estimator).

From the asymptotic normality of the MLE [100], the error of the corresponding estimation on θ follows a normal distribution with zero mean and the covariance matrix given by the CRLB derived in previous subsections. Let us denote the variance of estimation error on parameter a_i as $v a r (â_{i}^{M L E})$ . Recall the relation between source reliability (i.e., t_i) and estimation parameter a_i and b_i is t_i is given by Equation (5.6). For a sensing topology with small values of M and N, the estimation of t_i has a complex distribution and its estimation variance can be approximated [109]. The denominator of t_i is equivalent to s_i based on Equation (5.6).* Therefore, $({\hat{t}}_{i}^{M L E} - t_{i}^{0})$ also follows a normal distribution with zero mean and variance given by:

$\begin{array}{l} v a r ({\hat{t}}_{i}^{M L E}) = {(\frac{d}{s_{i}})}^{2} v a r (â_{i}^{M L E}) \end{array}$

si50_e (6.18)

Hence, the confidence interval that can be obtained to quantify the estimation accuracy of the MLE on source reliability. The confidence interval of the reliability estimation of source S_i (i.e., ${\hat{t}}_{i}^{M L E}$ ) at confidence level p is given by the following:

$\begin{array}{l} ({\hat{t}}_{i}^{M L E} - c_{p} \sqrt{v a r ({\hat{t}}_{i}^{M L E})}, {\hat{t}}_{i}^{M L E} + c_{p} \sqrt{v a r ({\hat{t}}_{i}^{M L E})}) \end{array}$

(6.19)

where c_p is the standard score (z-score) of the confidence level p. For example, for the 95% confidence level, c_p = 1.96. Therefore, the derived confidence interval of the source reliability MLE can be computed by using the CRLB derived earlier.

We reviewed how to compute the CRLB and the confidence interval in source reliability from the MLE of the EM algorithm. However, one problem remains to be answered is how to estimate the accuracy of the claim classification (i.e., false positives and false negatives) without having the ground truth values of the claims at hand. In this subsection, we review a quick and effective method to answer the above question under the maximum likelihood hypothesis.

The results of the EM algorithm not only offered the MLE on the estimation parameters (i.e., θ) but also the probability of each claim to be true, which is given by [100]:

$\begin{array}{l} Z_{j}^{*} = p (z_{j} = 1 | X_{j}, θ^{*}) \end{array}$

(6.20)

where X_j is the observed data of the claim C_j and θ* is the MLE of the parameter. Since the claim is binary, it is judged as true if Z_j*≥ 0.5 and false otherwise. Based on the above definition, the false positives and false negatives of the claim classification can be estimated as follows:

$\begin{array}{l} F P & = \sum_{j : Z_{j}^{*} \geq 0.5}^{N} {Z_{j}^{*} \times 0 + (1 - Z_{j}^{*}) \times 1} \\ = \sum_{j : Z_{j}^{*} \geq 0.5}^{N} (1 - Z_{j}^{*}) \end{array}$

si54_e (6.21)

$\begin{array}{l} F N & = \sum_{j : Z_{j}^{*} < 0.5}^{N} {Z_{j}^{*} \times 1 + (1 - Z_{j}^{*}) \times 0} \\ = \sum_{j : Z_{j}^{*} < 0.5}^{N} Z_{j}^{*} \end{array}$

si55_e (6.22)

where FP and FN stand for false positives and false negatives, respectively. From above equations, the estimated false positives and false negatives of the claim classification can be computed under the maximum likelihood hypothesis. This enables us to estimate the accuracy of the claim classification without knowing the ground truth values a priori.

In this section, we reviewed a confidence interval derivation in source reliability and proposed an accuracy estimator on the claim classification. This allows social sensing applications to assess the quality of their estimation on source reliability as well as the accuracy of claim classification. In the following section, we will review the evaluation of the performance of the computed confidence bounds on source reliability and the estimated false positives and false negatives on the claim classification.

6.5 Examples and Results

In this section, we review the evaluation of the performance of the credibility estimation and confidence quantification approach through both extensive simulation studies and a real world social sensing application. The reported CRLB results are computed upon the estimated a’s and b’s instead of the ground truth. In practice, it provides a sense of the sensitivity (or significance) of the estimated values. The same simulator as described in the previous two chapters was used. The prior d discussed in Section 6.1 is set to be 0.5 unless otherwise specified and the initial value of d assumed by the EM algorithm is set to be uniformly distributed between 0.4 and 0.6.

6.5.1 Evaluation of Confidence Interval

In this subsection, we review the evaluation of the accuracy of the confidence interval in source reliability derived in the previous section. Experiments were carried out over three different observation matrix scales: small, medium, and large. The simulation parameters are listed in Table 6.1. The total number of claims is the sum of both true and false ones. The average observations reported by each source is set to 100. For each observation matrix scale, the confidence interval in source reliability was computed based on Equation (6.19). The experiments were repeated 100 times for each observation matrix scale. Three representative confidence levels (i.e., 68%, 90%, and 95%) are used in the evaluation.

Table 6.1

Parameters of Three Typical Observation Matrix Scale

Observation Matrix Scale	Number of Sources	Number of True Claims	Number of False Claims
Small	100	500	500
Medium	1000	1000	1000
Large	10,000	5000	5000

t0010

Figure 6.1 shows the normalized probability density function (PDF) of source reliability estimation error over three observation matrix scales. The experimental PDF was computed by leveraging the actual estimation error (i.e., compare to the ground truth) and the confidence interval derived in Section 6.4. The experimental PDF was compared with the standard Gaussian distribution to verify the asymptotic normality property of estimation results. We observe the experimental PDF match well with the theoretical Gaussian distribution over three observation matrix scales.

f06-01a-9780128008676 — Figure 6.1 Normalized source reliability estimation error PDF. (a) Small observation matrix. (b) Medium observation matrix. (c) Large observation matrix.

f06-01b-9780128008676 — Figure 6.1 Normalized source reliability estimation error PDF. (a) Small observation matrix. (b) Medium observation matrix. (c) Large observation matrix.

Figure 6.2 shows the comparison between the actual estimation confidence and three different confidence levels that were set for the small observation matrix scenario. The actual estimation confidence is computed as the percentage of sources whose estimation error stay within the corresponding confidence bound for every experiment. This percentage represents the probability that a randomly chosen source keeps its reliability estimation error within the confidence bound. We observe that the actual estimation confidence of using three different confidence bounds stays close to the corresponding confidence levels that were used for the experiment. Moreover, at higher confidence levels, a lower fluctuation of the actual estimation confidence is observed. Similar results are observed for the medium and large observation matrices as well, which are shown in Figures 6.3 and 6.4, respectively. Additionally, we also note that the fluctuation of the actual estimation confidence decreases as the observation matrix scale increases. This is because the estimation variance characterized by CRLB is inversely proportional to the number of claims in the system, which will be further evaluated in the next subsection.

f06-02a-9780128008676 — Figure 6.2 Source reliability estimation confidence for small observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

f06-02b-9780128008676 — Figure 6.2 Source reliability estimation confidence for small observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

f06-03a-9780128008676 — Figure 6.3 Source reliability estimation confidence for medium observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

f06-03b-9780128008676 — Figure 6.3 Source reliability estimation confidence for medium observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

f06-04a-9780128008676 — Figure 6.4 Source reliability estimation confidence for large observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

f06-04b-9780128008676 — Figure 6.4 Source reliability estimation confidence for large observation matrix. (a) 68% confidence level. (b) 90% confidence level. (c) 95% confidence level.

6.5.2 Evaluation of CRLB

In this subsection, we review the evaluation of the accuracy of derived CRLBs (both actual and asymptotic) in Sections 6.2 and 6.3 by comparing them to the actual estimation variance of the estimation parameter (i.e., a_i, b_i). The actual estimation variance is characterized by the average RMSE (square root of the mean squared error) of all sources.

Scalability study

The scalability of CRLB performance with respect to the sensing topology (i.e., M and N) was first evaluated. The first experiment evaluates the effect of the number of sources (i.e., M) in the system on the CRLB performance. It started with the actual CRLB evaluation. The true and false claims were fixed to 1000 respectively, the average observations per source is set to 100. The number of sources from was varied 5 to 31. Reported results are averaged over 100 experiments and are shown in Figure 6.5. Observe that the actual CRLB tracks the variance of estimation parameters accurately even when the number of sources is small (e.g., M ≤ 20) in the system. We also observe that the RMSE is smaller than the actual CRLB when there are too few sources. This is because the MLE is biased on those points due to the small dataset. As illustrated in Section 6.2, the computation of actual CRLB does not scale with the number of sources in the system. Hence, the accuracy of asymptotic CRLB was also evaluated when the number of sources becomes large. The experimental configuration was kept the same as above, but the number of sources was changed from 10 to 150. Results are shown in Figure 6.6. We observe that the asymptotic CRLB deviates from the actual estimation variance when the number of sources is small (e.g., M ≤ 20). However, as the number of sources becomes sufficient in the network, the RMSE converges to the asymptotic CRLB quickly and the difference between the two becomes insignificant.

f06-05a-9780128008676 — Figure 6.5 Actual CRLB of a_i and b_i versus varying M. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-05b-9780128008676 — Figure 6.5 Actual CRLB of a_i and b_i versus varying M. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-06a-9780128008676 — Figure 6.6 Asymptotic CRLB of a_i and b_i versus varying M. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-06b-9780128008676 — Figure 6.6 Asymptotic CRLB of a_i and b_i versus varying M. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

The second experiment compares the derived CRLBs (both actual and asymptotic) to the RMSE of estimation parameters when the number of claims (i.e., N) changes. As shown in Sections 6.2 and 6.3, both the actual and asymptotic CRLB decrease as $\frac{1}{N}$ . As before, the accuracy of actual CRLB was first evaluated. The number of sources was fixed as 20, the average number of observations per source is set to 100. The number of true and false claims were kept the same. The number of claims varies from 1000 to 2000. Reported results are averaged over 100 experiments and are shown in Figure 6.7. We observe that the actual CRLB is able to track the RMSE on estimation parameter correctly and they both decrease approximately as $\frac{1}{N}$ when the number of claim increases. Similarly, the experiment was carried out to evaluate the accuracy of asymptotic CRLB. The experimental configuration was kept the same as above, but the number of sources was set to be 100. Results are shown in Figure 6.8. We observe that the asymptotic CRLB also follows closely on the RMSE of the estimation parameter and they reduce approximately as $\frac{1}{N}$ when the number of claim increases.

f06-07a-9780128008676 — Figure 6.7 Actual CRLB of a_i and b_i versus Varying N. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-07b-9780128008676 — Figure 6.7 Actual CRLB of a_i and b_i versus Varying N. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-08a-9780128008676 — Figure 6.8 Asymptotic CRLB of a_i and b_i versus varying N. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-08b-9780128008676 — Figure 6.8 Asymptotic CRLB of a_i and b_i versus varying N. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

Trustworthiness and assertiveness study

In the trustworthiness study, the estimation performance of CRLB was evaluated when the ratio of trusted sources in the system changes. The trusted sources are the sources who always make correct observations (i.e., their reliability is 1) and the ratio of trusted sources is the ratio of the number of trusted sources over the total number of sources in the system. The reliability of non-trusted sources is uniformly distributed in the range of (0,1). The true and false claims were fixed to 1000 respectively, the number of sources is set to 20 and each source reports 100 observations on average. The trusted source ratio was varied from 0 to 0.9. Reported results are averaged over 100 experiments and shown in Figure 6.9. Observe that the actual CRLB tracks the estimation variance tightly when the trusted source ratio changes. We also note that both the actual CRLB and the estimation variance of estimation parameters improve as the trusted source ratio increases. The reason is: the estimation error decreases as the ratio of sources with t_i = 1 increases. This is also reflected by the fact that b_i = 0 for trusted sources and the asymptotic variance goes to zero as one can see in (6.16). Similarly, experiments were carried out to evaluate the accuracy of the asymptotic CRLB. The experiment configuration was kept the same as above, but the number of sources was set to be 100. Results are shown in Figure 6.10. We observe that the asymptotic CRLB also follows the estimation variance of the estimation parameters correctly and they improve as the trusted source ratio increases.

f06-09a-9780128008676 — Figure 6.9 Actual CRLB of a_i and b_i versus trusted sources ratio. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-09b-9780128008676 — Figure 6.9 Actual CRLB of a_i and b_i versus trusted sources ratio. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-10a-9780128008676 — Figure 6.10 Asymptotic CRLB of a_i and b_i versus trusted sources ratio. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-10b-9780128008676 — Figure 6.10 Asymptotic CRLB of a_i and b_i versus trusted sources ratio. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

In the assertiveness study, the estimation performance of CRLB was evaluated when the assertiveness ratio of sources changes. The assertiveness ratio is defined as the ratio of the input data size (in terms of the number of observations) normalized by a pre-defined data size. This ratio reflects the sparsity of the sensing topology when the algorithm starts to run. The true and false claims were fixed to 1000 respectively. The number of sources is set to 20. The input data size that is used for the assertiveness ratio normalization (i.e., having assertiveness ratio of 1) is set to 1000 observations per source. The assertiveness ratio was varied from 0.1 to 1. Reported results are averaged over 100 experiments and shown in Figure 6.11. We observe that the actual CRLB tracks the RMSE of the estimation parameters correctly as the assertiveness ratio changes. We also note that the estimation variance of parameter a_i first increases and then decreases while the estimation variance of parameter b_i increases as the assertiveness ratio increases. The reason is: two factors affect the variance of the estimation parameters in different directions when the assertiveness ratio changes. One factor is the probability that a source reports a claim (i.e., s_i), defined in Section 6.1. This factor increases as the assertiveness ratio increases, which will enlarge the estimation variance of a_i and b_i based on (5.5). The other factor is the estimation variance of the source reliability (i.e., t_i), which decreases as the assertiveness ratio increases. Hence, the estimation variance of a_i is first dominated by the first factor and then by the second one while the estimation variance of b_i is dominated by the first factor in the evaluation range as the assertiveness ratio increases. Similar experiments were then carried out to evaluate the accuracy of the asymptotic CRLB. The experiment configuration was kept the same as above, but the number of sources was set to be 100. Results are shown in Figure 6.12. We observe that the asymptotic CRLB also follows the variance of the estimation parameters tightly and their trends are similar as those of actual CRLB.

f06-11a-9780128008676 — Figure 6.11 Actual CRLB of a_i and b_i versus assertiveness ratio. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-11b-9780128008676 — Figure 6.11 Actual CRLB of a_i and b_i versus assertiveness ratio. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-12a-9780128008676 — Figure 6.12 Asymptotic CRLB of a_i and b_i versus assertiveness ratio. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-12b-9780128008676 — Figure 6.12 Asymptotic CRLB of a_i and b_i versus assertiveness ratio. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

Robustness study

In the robustness study, the robustness (or sensitivity) of the estimation performance and the derived CRLBs were evaluated when the number of sources changes under different source reliability distributions. The key characteristic that determines the resilience of a network is the network topology. The social sensing topology is characterized by the link connections between sources and two sets of claims (i.e., true and false). The link connection skew is mainly determined by the source reliability distribution. Two representative network topologies were considered: scale-free and exponential topologies in the evaluation. For scale-free topology, sources have diverse reliability and the probability for sources to have different reliability is similar. For exponential topology, sources have similar reliability and nodes with higher reliability are exponentially less probable. The experiments were done by source removal (i.e., sources are randomly selected and removed from the system). This represents the scenario where random sources decide to quit the sensing application or their sensing devices fail. However, it is equivalent to reversing the steps and investigating the addition of sources.

In the first experiment, the estimation performance and the derived CRLBs were evaluated under the scale-free network topology. To generate the scale-free network topology, source reliability follows a uniform distribution on its definition range. The actual CRLB compared to the RMSE on the estimation parameter. Both the number of true and false claims were fixed to 1000. The average number of observations per source was set to 100. The experiments started with 25 sources and gradually removed sources from the system. Figure 6.13 shows the actual CRLB and RMSE of the estimation parameter. Observe that the estimation accuracy (i.e., RMSE) degrades gracefully and the actual CRLB tracks the RMSE reasonably well as the number of removed sources increases. Also note that the actual CRLB deviates slightly from the RMSE when majority of sources are removed from the system. Then similar experiments were repeated for the asymptotic CRLB as well. The experiments started with 150 sources and gradually removed the sources from the system. Results are shown in Figure 6.14. The results for asymptotic CRLB are similar to actual CRLB.

f06-13a-9780128008676 — Figure 6.13 Actual CRLB of a_i and b_i versus source removal of scale-free topology. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-13b-9780128008676 — Figure 6.13 Actual CRLB of a_i and b_i versus source removal of scale-free topology. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-14a-9780128008676 — Figure 6.14 Asymptotic CRLB of a_i and b_i versus source removal of scale-free topology. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-14b-9780128008676 — Figure 6.14 Asymptotic CRLB of a_i and b_i versus source removal of scale-free topology. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

In the second experiment, the estimation performance and the derived CRLBs were evaluated under the exponential network topology. To generate the exponential network topology, the source reliability follows a normal distribution (with the mean value as the mean of its definition range and a reasonably small variance). Similar as before, the actual CRLB compared to the RMSE on the estimation parameter. The standard deviation of the normal distribution of source reliability was set to 0.02, other settings were kept the same as the first experiment. Figure 6.15 shows the actual CRLB and RMSE of the estimation parameter. Observe that RMSE increases gradually as the number of removed sources grows and the actual CRLB tracks the RMSE well. Similar experiments were repeated for the asymptotic CRLB as well. The experimental settings were kept the same as the first experiment. Results are shown in Figure 6.16. Similar results of the actual CRLB are observed for the asymptotic CRLB.

f06-15a-9780128008676 — Figure 6.15 Actual CRLB of a_i and b_i versus source removal of exponential topology. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-15b-9780128008676 — Figure 6.15 Actual CRLB of a_i and b_i versus source removal of exponential topology. (a) Actual CRLB of a_i. (b) Actual CRLB of b_i.

f06-16a-9780128008676 — Figure 6.16 Asymptotic CRLB of a_i and b_i versus source removal of exponential topology. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

f06-16b-9780128008676 — Figure 6.16 Asymptotic CRLB of a_i and b_i versus source removal of exponential topology. (a) Asymptotic CRLB of a_i. (b) Asymptotic CRLB of b_i.

For both the scale-free and exponential topology of social sensing, the above results show that the estimation performance is relatively robust (or insensitive) to changes in the number of sources in the network. Both actual and asymptotic CRLBs are able to track the estimation performance as long as a limited number of sources stay in the system.

6.5.3 Evaluation of Estimated False Positives/Negatives on Claim Classification

In this subsection, the estimated false positives/negatives on claim classification were evaluated by comparing them to the actual false positives/negatives (i.e., the ones that are computed from the ground truth). Similar experiments in the previous subsection were carried out to evaluate its performance through scalability, trustworthiness, assertiveness, and robustness studies.