Chapter 4
Modeling and Analysis of Fault Detection and Fault Tolerance in Embedded Wireless Sensor Networks*

Wireless sensor networks (WSNs) consist of spatially distributed autonomous sensor nodes that collaborate with each other to perform an application task. A sensor node comprises a sensing unit (a sensing unit contains sensors such as temperature and humidity sensors), a processing unit, a storage unit, a communication unit, a power unit, an actuator unit, and an optional location finding unit [44].

Figure 4.1 shows the WSN architecture that we consider in this chapter [81–83]. The sensor nodes distributed in the sensor field gather information (sensed data or statistics) about an observed phenomenon (e.g., environment, target) using attached sensors. A group of sensor nodes, located geographically close to each other, is called a WSN cluster. Each WSN cluster has one cluster head (WSN cluster formation, and cluster head determination/maintenance is beyond the scope of this chapter). Sensed data within a WSN cluster is collected by the cluster head and is relayed to a sink node (or base station) via the sensor nodes' ad hoc network. The sink node transmits the received information back to the WSN designer and/or WSN manager via a gateway node connected to a computer network. The WSN designer is responsible for designing the WSN for a particular application to meet application requirements such as lifetime, reliability, and throughput. After WSN design and deployment, the WSN manager manages WSN operations, such as data analysis, monitoring alive and dead sensor nodes, and alarm conditions (e.g., forest fire, volcano eruption).

c04f001

Figure 4.1 Wireless sensor network architecture

WSN research and design have gained significance in recent years because of the increasing proliferation of WSNs in a wide variety of application domains. These application domains include, but are not limited to, mission-critical (e.g., security, defense, space, satellite) or safety-related (e.g., health care, active volcano monitoring) systems. The utility of WSNs for safety-related systems can be illustrated by a volcano monitoring example. Studying active volcanoes typically requires sensor arrays to collect seismic and infrasonic (low-frequency acoustic) signals. Distributed sensor nodes in a WSN provide a correlated spatial distribution of seismic and infrasonic events that greatly facilitate scientific studies of wave propagation phenomena and volcanic source mechanisms [84]. The analytics of seismic and infrasonic data obtained by a WSN can help the scientists in better prediction of volcanic events, such as volcanic eruption and earthquakes. Although WSNs can benefit numerous applications, the realization of WSNs for an application requires addressing various design challenges.

A crucial challenge in WSN design is to meet varying application requirements (e.g., lifetime, throughput, reliability), given the limited energy, computation, and storage available on sensor nodes. Sensor nodes are often deployed in hostile environments, such as forests, the ocean floor [85], animal habitats [86], and active volcanoes [84], that further exacerbates the design challenge of meeting application requirements. Unattended and/or hostile deployment environments make sensor nodes more susceptible to failures than other systems [84], and manual inspection of sensor nodes after deployment can be impractical. Although faults can occur in any of the sensor node components, sensors and actuators have significantly higher fault rates than other semiconductor-based systems. For example, NASA aborted the launch of space shuttle Discovery [87] because of a sensor failure in the shuttle's external tank [88]. Sensor failure can occur due to deployment impact, accidents (animal, vehicular), fire, extreme weather, or aging [89]. Failed sensor nodes may result in sensor network partitioning (i.e., sensor nodes become isolated and are disconnected from the sensor network), reduced WSN availability, and WSN failure. In order to meet application requirements in the presence of sensor failures, incorporation of fault detection and fault tolerance (FT) mechanisms in WSNs is imperative.

While traditional reliability models can be readily applied to FT systems, faults do not always result in failure. Transient, malicious errors occur due to noise, voltage levels, and broken components. Therefore, it is imperative to model the ability for fault detection to provide coverage against such faults to protect mission-critical data. Fault detection includes distributed fault detection (DFD) algorithms that identify faulty sensor readings to diagnose a faulty sensor. DFD algorithms do not incur additional transmission cost as they use existing network traffic to identify sensor failures. The accuracy of a fault detection algorithm signifies the algorithm's ability to accurately identify faults. We analyze and study two well-cited fault detection algorithms and compare their performance under realistic conditions. We have implemented and simulated the DFD algorithms incorporating the protocol stack and a realistic WSN topology as opposed to only algorithm simulation that was done in most previous works [90, 91].

Although fault detection helps in isolating faulty sensors, FT incorporation in WSNs is imperative to reliably accomplish application tasks. A prominent FT technique is to add redundant hardware and/or software [92]. Stringent design constraints (e.g., power, cost) differentiate WSNs from other systems, and consequently the added redundancy for FT must justify the additional cost. Since sensors (e.g., temperature, light, motion) attached to sensor nodes have comparatively higher fault rates than other components (e.g., processor, transceiver) [88, 93, 94], sensor redundancy would be the most effective to enhance FT capability of sensor nodes. Fortunately, sensors are cheap and adding redundant spare sensors would add little to the individual sensor node's cost.

Although FT is a well-studied research field [95–100], fault diagnosis and FT for WSNs are relatively unstudied. Varying FT requirements across different applications increase the complexity of fault detection and FT for WSNs. For instance, mission-critical applications have relatively high reliability requirements as compared to non-mission-critical applications such as ambient conditions monitoring. To the best of our knowledge, there exists no sensor node model to provide better reliability for mission-critical applications. Since applications are typically designed to operate reliably for a certain period of time (i.e., lifetime requirement), FT metrics such as reliability and mean time to failure (MTTF) need to be considered in WSN design. WSN designers require a model to estimate FT metrics and determine necessary redundancy during design time. Unfortunately, the literature provides no rigorous mathematical model that provides insights into WSN reliability and MTTF. Previous works study fault detection and FT in isolation, and their synergistic relationship has not been investigated in the context of WSNs.

This chapter's highlights are as follows:

  • Fault diagnosis in WSNs through various fault detection algorithms.
  • Simulation of fault detection algorithms using nsc04-math-00012 to determine the algorithms' accuracy and false alarm rates. nsc04-math-00022 is a discrete event simulator that provides support for the simulation of networks and routing protocols over wired and wireless networks [101]. We further analyze the effectiveness of fault detection algorithms under conditions modeled from the real-world data.
  • Proposal of an FT sensor node model consisting of duplex sensors (i.e., one active sensor and one inactive spare sensor) that exploits the synergy of fault detection and FT. Although our model can be extended for sensors with N-modular redundancy (NMR) [92], we suggest duplex sensor nodes to minimize the additional cost.
  • Characterization of FT parameters, such as coverage factor, by exploiting the synergy between fault detection and FT.
  • Proposal of hierarchical Markov models to characterize WSN reliability and MTTF. For the first time, we delineate reliability and MTTF hierarchically at the sensor node, WSN cluster, and WSN levels (Fig. 4.1).
  • Determination of iso-MTTF (isoreliability) for WSN clusters.1 We define iso-MTTF (isoreliability) as how many redundant sensor nodes a non-fault tolerance (NFT) WSN cluster requires over an FT WSN cluster to achieve an equal MTTF (reliability).
  • Research challenges and future research directions for the design and deployment of reliable and trustworthy WSNs.

In our duplex sensor model, we assume that the redundant sensor is in a cold standby mode, which ideally consumes no power. Although we focus on sensor failures within the sensor node in this study due to higher sensor failure rates as compared to other sensor node components [88, 93], our model can be extended to include failures for other components within the sensor node such as the processor and transceiver. Our FT WSN modeling serves as a first step toward FT WSN modeling and can potentially facilitate further research in FT WSN modeling and evaluation.

Our Markov models are comprehensive and characterize sensor nodes, WSN clusters, and overall WSN reliability and MTTF. The proposed Markov modeling enables WSN designers to investigate the effects of different types of FT sensor nodes (e.g., duplex, NMR), number of WSN clusters, and the number of sensor nodes in the cluster on the FT of the overall WSN. Our hierarchical WSN Markov models enable designers to select an efficient WSN topology to better meet application requirements.

The remainder of this chapter is organized as follows. Section 4.1 gives a review of related work. Section 4.2 elaborates on fault diagnosis in WSNs. Section 4.3 discusses two fault detection algorithms to elucidate the fault diagnosis mechanism in WSNs. Section 4.4 describes our Markov models for characterizing WSN reliability and MTTF. Implementation and simulation of fault detection algorithms using nsc04-math-00032 are presented in Section 4.5. Numerical results from our Markov modeling are presented in Section 4.6. Section 4.7 highlights research challenges and future research directions for the design and deployment of reliable and trustworthy WSNs. Finally, Section 4.8 concludes the chapter.

4.1 Related Work

Despite fault detection and FT being well-studied research fields [95, 98, 100], little work exists in WSN fault detection and FT. This section summarizes some of the previous works in the literature related to fault detection and FT.

4.1.1 Fault Detection

Jiang [102] proposed a DFD scheme that detected faulty sensor nodes by exchanging data and mutually testing neighboring nodes. Jian-Liang et al. [103] proposed a weighted median fault detection scheme that used spatial correlations among the sensor measurements (e.g., temperature, humidity). Lee and Choi [104] presented a DFD algorithm that identified faulty sensor nodes based on comparisons between neighboring sensor nodes' data. The DFD algorithm used time redundancy to tolerate transient faults in sensing and communication. Khilar and Mahapatra [105] proposed a probabilistic approach to diagnose intermittent WSN faults. The simulation results indicated that the DFD algorithm's accuracy increased as the number of diagnostic rounds increased where neighboring sensor nodes exchanged measurements in each round.

Ding et al. [91] proposed algorithms for faulty sensor identification and FT event boundary detection. The algorithms considered that both the faulty sensors and normal sensors in an event region could generate abnormal readings (readings that deviate from a typical application-specific range). Krishnamachari and Iyengar [106] proposed a distributed Bayesian algorithm for sensor fault detection and correction. The algorithm considered that measurement errors due to faulty equipment are likely to be uncorrelated. Wu et al. [107] presented a fault detection scheme in which the fusion center (the node that aggregated data from different nodes) attempted to identify faulty sensor nodes through temporal sequences of received local decisions using a majority voting technique. Lo et al. [108] presented a distributed, reference-free fault detection algorithm based on local pairwise verification between sensor nodes monitoring the same physical phenomenon. The authors observed that a linear relationship existed between the outputs of a pair of sensor nodes that could be exploited to detect faulty sensor nodes. Since the fault detection was done pairwise, the algorithm was able to conserve energy. Results revealed that the proposed algorithm could achieve a detection accuracy of 84% and a false alarm rate of 0.04%.

Miao et al. [109] proposed agnostic diagnosis for detecting silent failures (i.e., failures with unknown types and symptoms). The proposed detection technique exploited the fact that a sensor node's metrics (e.g., radio on-time, number of packets transmitted in a time interval) exhibited certain correlation patterns, violation of which indicated potential silent failures. The detection accuracy of the proposed scheme was close to 100% for small WSNs, whereas the detection accuracy decreased sharply and false alarm rate increased as the WSN size increased. The proposed technique required a sink node to collect data from all the sensor nodes in the WSN, which resulted in rapid energy depletion of the sink node and sensor nodes near the sink node. Furthermore, the fault detection latency of the proposed scheme was high due to its centralized nature.

There exists some work related to anomaly detection in WSNs. Bhargava and Raghuvanshi [110] proposed a method for anomaly detection in WSNs based on S-transform (an extension of continuous wavelet transform). The S-transform extracted the features from sensor nodes' data set, which were used for training of support vector machine (SVM). The SVM was then used for the classification of normal and anomalous data. Results revealed that the proposed scheme's accuracy for data classification ranged from 78% to 94%. Salem et al. [111] proposed an anomaly detection algorithm for medical WSNs. The proposed algorithm first classified instances of sensed patient attributes as normal and abnormal. The algorithm then used regression prediction to discern between a faulty sensor reading and a patient entering into a critical state. The results demonstrated the algorithm's ability to achieve a relatively low false alarm rate (c04-math-00041%) and a good detection accuracy (the attained accuracy was not specified) for the medical WSN application.

4.1.2 Fault Tolerance

In work related to FT for WSNs, Koushanfar et al. [94] proposed an FT scheme that provided backup for one type of sensor using another type of sensor, but they did not propose any FT model. Clouqueur et al. [112] presented algorithms for collaborative target detection in the presence of faulty sensors. Chiang et al. [113] designed system-level test interfaces for remote testing, repair, and software upgrade for sensor nodes. The authors evaluated the proposed test interfaces on Texas Instrument's MSP430 microcontroller-based sensor nodes. Experimental results revealed that the test interfaces with double, triple, and quadruple redundancies increased the WSN's availability.

Krasniewski et al. [114] proposed a protocol to diagnose and mask arbitrary sensor node failures in an event-driven WSN. The protocol considered a clustered WSN with rotating cluster heads. The protocol assigned each sensor node a trust index to indicate the sensor node's track record in reporting past events correctly. The cluster head analyzed the event reports using the trust index and made event decisions. The authors simulated the protocol using nsc04-math-00052 and observed that the protocol was able to detect events correctly even if more than 50% of the sensor nodes were compromised. Sun et al. [115] introduced a trust mechanism to evaluate the trustworthiness of a sensor node and the sensor node's data. The authors defined memory depth to characterize the temporal correlation and aggregation bandwidth to characterize spatial correlation. A sensor node used memory depth to calculate self-trustworthiness of the sensor node's data based on stored historical/past data and the current data. When a sensor node reported its sensed data to the aggregator, the aggregator's trustworthiness to the reported result depended both on the trustworthiness of the sensor node reporting data and the trustworthiness of the sensor node to its sensed data. Subsequently, each aggregator reported the aggregated result and the aggregator's self-trust opinion to the upper layer aggregator progressively, and so on until the aggregated result reached the sink node. Results demonstrated the effectiveness of the proposed mechanism for continuous media streaming and discrete data in wireless multimedia sensor networks (WMSNs).

There exists some work on providing FT in WSNs by deploying relay nodes and considering connectivity as an FT metric. Relay nodes communicate with sensor nodes, other relay nodes, and sink nodes to prolong WSN lifetime. Zhang et al. [116] developed approximation algorithms for determining a minimum number of relay nodes along with the relay nodes' placement to achieve certain connectivity requirements. Han et al. [117] considered the problem of deploying relay nodes to provide FT in heterogeneous WSNs where sensor nodes had different transmission radii. They developed approximation algorithms for full and partial FT relay node placement. Full FT relay node placement deployed a minimum number of relay nodes to establish disjoint paths between every sensor and/or relay node pair, whereas partial FT relay node placement only considered sensor node pairs. Baldi et al. [118] evaluated gossip algorithms, which are distributed algorithms that distribute computational burden across all nodes and considered connectivity as an FT metric.

Sen et al. [119] introduced region-based connectivity: an FT metric defined as the minimum number of nodes within a region whose failure would disconnect the network. Alwan and Agarwal [120] provided a survey of FT routing techniques in WSNs. Souza [121] presented a framework for failure management in WSNs, focusing on fault diagnosis and recovery techniques. The presented FT framework mitigated the failure propagation in a business (enterprise) environment by implementing different FT techniques.

4.1.3 WSN Reliability Modeling

There exists some work on FT WSN modeling. Cai et al. [122] presented a reliability model to prolong the network lifetime and availability based on connectivity and coverage constraints. Zhu and Papavassiliou [123] presented a model that characterized sensor connectivity and investigated the trade-offs among sensor node connectivity, power consumption, and data rate. They also discussed the impact of sensor connectivity on system reliability. Vasar et al. [124] presented Markov models for WSN reliability analysis. They presented a reliability comparison for various numbers of defective components' replacements with hot-standby redundant components. Xing and Michel [125] presented WSN reliability and security modeling in an integrated manner. Their modeling technique differentiated two types of WSN failures: security failures due to malicious intrusions and traditional failures due to malfunctioning components.

Moustapha and Selmic [88] used recurrent neural networks to model sensor node dynamics for sensor fault detection. Their network model corresponded to the WSN topology such that the recurrent neural network input was taken from the modeled sensor node and neighboring sensor nodes. Kannan and Iyengar [126] developed a game-theoretic model of reliable length- and energy-constrained routing in WSNs. They showed that optimal length-constrained paths could be computed in polynomial time in a distributed manner using geographic routing. Mukhopadhyay et al. [127] presented a method that used sensor data properties to enable reliable data collection. The method consisted of predictive models based on temporal correlation in sensor data. They demonstrated that their method could handle multiple sources of errors simultaneously and could correct transient errors arising in sensor node hardware and wireless communication channels.

Although DFD algorithms were proposed in the literature for detecting sensor faults, the fault detection was not leveraged to provide FT. There exists work in the literature regarding FT in WSNs [94, 116, 117, 119, 122]; however, FT metrics such as reliability and MTTF were not investigated rigorously in the context of WSNs. Specifically, reliability and MTTF were not considered hierarchically at the sensor node, WSN cluster, and WSN level. This hierarchical characterization of FT metrics facilitates a WSN designer to determine an appropriate WSN topology (i.e., number of clusters and sensor nodes with required FT capability in each cluster).

4.2 Fault Diagnosis in WSNs

The sensor nodes in WSNs are often deployed in unattended and possibly hostile environments that make sensor nodes susceptible to failures. Faults in sensor nodes can occur at various levels, such as processing unit, transceiver, storage unit, power unit, sensors, and actuators. Since faults are inevitable, it is imperative to determine faulty and fault-free sensor nodes in the WSN. Traditional fault diagnosis techniques that are developed for multiprocessor systems are not directly applicable to WSNs due to the sensor nodes' stringent resource constraints. This section describes sensor faults and then elaborates on fault diagnosis in WSNs.

4.2.1 Sensor Faults

The faults in sensor nodes include permanent faults where a node becomes dead and Byzantine faults where the node behaves arbitrarily or maliciously. Figure 4.2 shows an example of Byzantine faults where sensor nodes A–E are equipped with temperature sensors and process the measured reading to give temperature in Fahrenheit. The sensor nodes send the sensor node's sensed temperature readings to other neighboring nodes. The temperature sensor in the sensor node D behaves maliciously and sends inconsistent and arbitrary temperature values to other nodes (60, 100, 150, and 10 to A, B, C, and E, respectively). The rest of the sensor nodes send consistent values to other neighboring nodes. As a consequence, nonfaulty sensor nodes obtain different global information about the temperature in the region and may conclude on a false value of the overall temperature of the region. The fault detection algorithm needs to be robust to such inconsistent behavior that can jeopardize the collaboration in the WSN [112].

c04f002

Figure 4.2 Byzantine faulty behavior in WSNs

Several efforts have been made to classify malicious sensor behavior [93, 128]. While different sensor fault taxonomies have been proposed, three particular behaviors are targeted by DFD methods. The outlier faults occur when a sensor signal spikes in value (Fig. 4.3(a)) and can be detected by comparing sensor readings to previous readings or neighboring sensor nodes' readings. The stuck-at faults occur when the output remains at a constant level (Fig. 4.3(b)). The noisy faults occur when the signal-to-noise ratio of the sensor's output is low, resulting in random data (Fig. 4.3(c)). An effective fault detection algorithm must be able to identify the broadest possible range of malicious output while minimizing fault positives. Because the broad range of sensor failures result in a mixed detection rate for different algorithms, accurate simulation is important for studying sensor failures [93].

c04f003

Figure 4.3 Various types of sensor faults [93]: (a) outlier faults; (b) stuck-at faults; (c) noisy faults

4.2.2 Taxonomy for Fault Diagnosis Techniques

A fault diagnosis system is a monitoring system that detects faulty sensor nodes and their location in the WSN. Fault diagnosis techniques are classified based on the nature of tests, correlation between sensor readings, and characteristics of sensor nodes [129]. The key component of a fault diagnosis system is a fault detection algorithm, and the key terminology includes correctness, latency, detection accuracy, and false alarm rate. A fault diagnosis is correct if no fault-free sensor nodes are mistakenly diagnosed as faulty. Latency is defined as the time elapsed since the appearance of the fault to the isolation of the faulty sensor node. Detection accuracy is defined as the ratio of the number of faulty sensor nodes detected to the actual number of faulty sensor nodes in the network. The accuracy of the fault detection algorithm is a crucial factor in maintaining reliable WSN operation. The ratio of the number of fault-free sensor nodes diagnosed as faulty to the actual number of fault-free sensor nodes is the false alarm rate. Fault detection techniques for WSN can be broadly categorized into centralized and distributed approaches [129].

4.2.2.1 Centralized Approaches

In a centralized approach, a geographically or logically centralized arbiter (i.e., sink node or base station) with higher computational power, larger memory size, and a greater energy supply than ordinary sensor nodes is responsible for fault detection and diagnosis of the overall WSN. The sink node periodically sends diagnostic queries into the network to obtain the state of the individual sensor nodes. The sink node then analyzes the received diagnostic response messages to determine faulty and fault-free sensor nodes. Centralized approaches are accurate; however, these approaches cannot be scaled for large-scale WSNs. The scalability of centralized approaches is limited because it is expensive for the sink node to accumulate and analyze diagnostic information from all the sensor nodes in the WSN. Furthermore, centralized approaches lead to rapid energy depletion in certain regions of the network, in particular, nodes closer to the sink node. The energy depletion in sensor nodes closer to the sink node causes network partitions, which results in unmonitored areas in the WSN. Moreover, the detection latency of centralized approaches is large due to multihop communication. Owing to these limitations of centralized approaches, distributed approaches are highly preferable in WSNs.

4.2.2.2 Distributed Approaches

In distributed approaches, each sensor node executes the fault detection algorithm and generates a localized fault view. The localized fault view is a sensor node's view regarding the fault states of the sensor node's one-hop neighbors. This localized fault view is then disseminated in the network such that each fault-free sensor node correctly diagnoses the state of all the sensor nodes in the network. Distributed approaches conserve the sensor nodes' energy and consequently prolong the WSN lifetime. Distributed approaches can be classified into various types [129]:

Test-Based Approaches: In test-based approaches, different tests (tasks) are assigned to sensor nodes and faulty sensor nodes are identified based on test results. Test-based approaches are further classified into invalidation-based and comparison-based approaches.

In invalidation-based approaches, every sensor node tests a set of sensor nodes. Each sensor node then passes the test results to other sensor nodes based on which a consensus is made on the faulty sensor nodes. Each sensor node must be tested by at least t one-hop neighbors to achieve a t-diagnosability. The greater the t, the greater the message overhead. In most of invalidation-based approaches, the number of faulty sensor nodes that can be detected is upper bounded by t. The detection accuracy of these approaches is close to 100% and the false alarm rate is close to zero; however, the detection latency exhibited by most of these approaches is high.

In comparison-based approaches, different tests (tasks) are assigned to a pair of sensor nodes and the results of these tasks are compared. The agreement and disagreement between these results form the basis for diagnosing faulty sensor nodes. Comparison-based approaches have comparatively less message and time complexity overhead than the invalidation-based approaches.

Neighbor Coordination Approaches: In neighbor coordination approaches, one-hop neighbors coordinate with each other to determine whether or not to disregard their own sensor readings based on the neighboring sensors readings or on the weights based on physical distances from the event and trustworthiness of the sensor node's measurements. Neighbor coordination approaches are further categorized into majority voting and weighted majority voting approaches.

Majority voting approaches exploit the fact that the faulty measurements are uncorrelated while nonfaulty measurements are spatially correlated. For example, assume that a sensor node c04-math-0006 is a neighbor of c04-math-0007, and c04-math-0008 and c04-math-0009 are the readings for c04-math-0010 and c04-math-0011, respectively. Sensor readings c04-math-0012 and c04-math-0013 are similar when c04-math-0014 where c04-math-0015 is application dependent and typically a small number. Majority voting techniques can provide good detection accuracy and low false alarm rates. However, majority voting techniques are topology dependent as the effectiveness of these approaches depends on the node degree in the WSN.

Weighted majority voting approaches weigh the sensing measurements based on properties such as physical distance from the event and trustworthiness of the measurements. Unlike simple majority voting, these weighted measurements are used to decide the state of a sensor node (i.e., faulty or nonfaulty) in a WSN. Weighted majority voting approaches are computationally more expensive than simple majority voting approaches. Weighted majority voting approaches are also topology dependent such as simple majority voting approaches (i.e., majority voting approaches accuracy degrades drastically in sparse networks).

Hierarchical Detection: In hierarchical detection, a spanning tree with the sink node as the root node is constructed that spans all fault-free sensor nodes in the WSN. This spanning tree is then used to determine faults at each level of the tree. The fault results are disseminated across the spanning tree such that each node in the network correctly diagnoses the fault states of all the sensor nodes in the WSN. The detection latency in a hierarchical detection approach is high because the diagnosis process is started either by the sink node or the leaf nodes and requires multihop communication of diagnostic messages. Furthermore, similar to the centralized approach, the hierarchical detection approach leads to rapid energy depletion in certain regions of the WSN, in particular, in sensor nodes close to the sink.

Node Self-Detection: In sensor node self-detection, the sensor node detects its own status (faulty or nonfaulty) with the help of additional hardware incorporated in the sensor node. Since the sensor node self-detection approach requires additional hardware, the approach increases hardware complexity, weight, and energy consumption.

Clustering-Based Approaches: Clustering-based approaches divide the overall WSN into clusters, which are groups of sensor nodes located geographically close to each other. Each cluster head executes a fault detection algorithm in the cluster head's group using a centralized or distributed approach. Clustering-based diagnostic approaches are communication-efficient; however, the cluster head requires more energy for leading the diagnostic process.

Watchdog Approaches: In watchdog approaches, a sensor node monitors whether the sensor node's packets are forwarded by the sensor node's one-hop neighbor by overhearing the communication channel. If a sensor node's neighboring node does not forward a packet within a certain time, the neighboring node is viewed as misbehaving. When the misbehaving rate exceeds a threshold, the misbehaving node is diagnosed as faulty. The source node then sends packets along other routes avoiding the diagnosed faulty node.

Probabilistic Approaches: Probabilistic approaches exploit the fact that sensor failure probability for different sensor nodes is not identical in a time interval that is not infinitesimally small. Probabilistic approaches normally leverage Bayesian or other statistical methods for fault diagnosis.

Event-Driven Diagnosis: In event-based WSNs as opposed to data-driven or query-driven WSNs, sensor nodes only report events of interest to the sink node in a timely manner. Fault diagnosis in event-driven WSNs requires special consideration for fault-event disambiguation since an event also causes abnormal data to be sensed by the sensor nodes that could be interpreted as a faulty measurement. Event detection approaches work effectively in dense networks with relatively low fault probability of sensor nodes. Most of the event detection approaches fail in distinguishing between events and faults if faults are located at the event boundary.

4.3 Distributed Fault Detection Algorithms

DFD algorithms are those in which each sensor node executes a fault detection algorithm to diagnose the status of the sensor node's neighboring sensor nodes and to obtain a localized fault view (Section 4.2.2.2). The localized fault view is then shared with neighboring sensor nodes to reach a consensus on faulty sensor nodes. We analyze different fault detection algorithms [90, 91, 102, 103, 106]; however, we present two fault detection algorithms for explaining the fault detection algorithms' functionalities:

4.3.1 Fault Detection Algorithm 1: The Chen Algorithm

The first algorithm, which we refer to as the Chen algorithm [90], relies on simple majority counts to determine faults. The approach is practical because sensor nodes have limited computing ability; hence, lightweight calculations are preferred over intensive calculations. Each sensor node c04-math-0016 can exist in four states: good (GD), faulty (FT), likely good (LG), or likely faulty (LF). A sensor node in likely states is unsure if the sensor node is faulty and uses the neighbors' states of sensor nodes to finalize the sensor node's decision. The algorithm's implementation complexity is low and the probability of correct diagnosis is high.

algorithm

Table 4.1 lists the notations used in the Chen algorithm [90]. The neighboring sensor nodes in a WSN are those that are within the transmission range of each other. Each sensor node sends the sensor node's sensed values to the sensor node's neighboring sensor nodes. A sensor c04-math-0044 generates test result c04-math-0045 based on the sensor node's neighbor c04-math-0046's measurements using variables c04-math-0047, c04-math-0048, and threshold values c04-math-0049 and c04-math-0050. The sensors c04-math-0051 and c04-math-0052 are both good or both faulty if c04-math-0053; however, the sensor nodes have a different good or faulty status if c04-math-0054. Each sensor node sends the sensor node's tendency values (i.e., estimated state) to all the sensor node's neighboring sensor nodes. The test values from neighboring sensors determine the tendency value of sensors to be LG or LF, and the number of LG sensors with the same results determines whether the sensors are GD or FT. Precisely, c04-math-0055 and c04-math-0056, c04-math-0057 must be greater than or equal to c04-math-0058 for c04-math-0059 to be identified as good. The algorithm diagnoses a good sensor c04-math-0060 as GD in the first round if the sensor node has less than c04-math-0061 bad neighbors. The Chen algorithm's important steps are depicted in Algorithm 1.

Table 4.1 Summary of notations used in the DFD algorithms

Notation Description
c04-math-0017 Set of all the sensors involved in the DFD algorithm
c04-math-0018 Set of neighbors of sensor node c04-math-0019
c04-math-0020 Measurement of c04-math-0021
c04-math-0022 Measurement difference between c04-math-0023 and c04-math-0024
at time c04-math-0025 (i.e., c04-math-0026)
c04-math-0027 Measurement time difference, c04-math-0028
c04-math-0029 Measurement difference between c04-math-0030 and c04-math-0031
from time c04-math-0032 to c04-math-0033 (i.e., c04-math-0034)
c04-math-0035 Test between c04-math-0036 and c04-math-0037, c04-math-0038, c04-math-0039
c04-math-0040 and c04-math-0041 Two predefined thresholds
c04-math-0042 Tendency value of a sensor: c04-math-0043

4.3.2 Fault Detection Algorithm 2: The Ding Algorithm

The second algorithm, which we refer to as the Ding algorithm [91], is a WSN fault detection algorithm with a low computational overhead. The results indicate that the algorithm can identify faulty sensors with relatively high accuracy even when the sensor failure probability is as high as 20%. The Ding algorithm's important steps are shown in Algorithm 2.

algorithm

The algorithm compares the sensor node c04-math-0062's reading with the sensor node's c04-math-0063 neighbors c04-math-0064 with measurements c04-math-0065. The comparison is done by comparing c04-math-0066 with median c04-math-0067 of c04-math-0068, that is,

4.1 equation

If there are c04-math-0070 sensors in total, the algorithm computes c04-math-0071 (i.e., c04-math-0072). The mean c04-math-0073 of c04-math-0074 is given by Ding et al. [91]:

4.2 equation

The standard deviation c04-math-0076 of c04-math-0077 is given as [91]:

4.3 equation

Standardizing the dataset c04-math-0079 yields c04-math-0080, that is:

4.4 equation

If c04-math-0082, then the Ding algorithm detects c04-math-0083 as faulty where c04-math-0084 is a predefined threshold value. Mathematically, c04-math-0085 if c04-math-0086 where c04-math-0087 denotes the set of sensors claimed as faulty.

Section 4.5 discusses the implementation and effectiveness, in particular the detection accuracy, which is used implicitly in our Markov model as the coverage factor c04-math-0088, of these fault detection algorithms.

4.4 Fault-Tolerant Markov Models

In this section, we present our Markov models for FT WSNs. Our Markov models are comprehensive and encompass hierarchically the sensor node, WSN cluster, and WSN levels (Fig. 4.1). We adopt a bottom-up paradigm in our modeling by first developing a sensor node model and determining the model's reliability and MTTF [130]. The sensor node MTTF gives the average sensor node failure rate, which is utilized in the WSN cluster model. The WSN cluster model gives the WSN cluster reliability and MTTF, which determines the WSN cluster average failure rate. The WSN cluster average failure rate is then utilized in the WSN model to determine the WSN reliability and MTTF. Our bottom-up approach enables WSN reliability and MTTF characterization by leveraging sensor node and WSN cluster models. We note that some WSN architectures do not employ cluster-based hierarchy due to the additional energy cost associated with cluster formation and cluster head election. However, our modeling approach is equally applicable for WSN architectures without a cluster-based hierarchy by assuming that the WSN is composed of just one cluster where the cluster head is the sink node. For clarity, Table 4.2 summarizes important notations used in our Markov models.

Table 4.2 Summary of notations used in our Markov models

Notation Description
c04-math-0089 Total number of sensors in a WSN
c04-math-0090 Coverage factor
c04-math-0091 Cumulative sensor failure probability
c04-math-0092 Sensor failure rate
c04-math-0093 Time over which sensor failure rate is specified
c04-math-0094 Probability of being in state c04-math-0095 at time c04-math-0096
c04-math-0097 Reliability of FT (duplex) sensor node
c04-math-0098 Mean time to failure of an FT sensor node
c04-math-0099 Average number of neighbor sensor nodes
c04-math-0100 FT sensor node failure rate with c04-math-0101 neighbors
c04-math-0102 WSN cluster reliability
c04-math-0103 WSN cluster failure rate with c04-math-0104 sensor nodes
c04-math-0105 Number of clusters in the WSN
c04-math-0106 Wsn reliability

4.4.1 Fault-Tolerance Parameters

Fault tolerance in WSNs is highly dependent on fault diagnosis in WSNs (Section 4.2). In this section, we characterize the FT parameters by exploiting the synergy between fault detection and FT in WSNs. The FT parameters leveraged in our Markov model are the coverage factor and sensor failure rate.

4.4.1.1 Coverage Factor

The coverage factor c04-math-0107 is defined as the probability that the faulty active sensor is correctly diagnosed, disconnected, and replaced by a good inactive spare sensor. The c04-math-0108 estimation is critical in an FT WSN model and can be determined by

where c04-math-0110 denotes the accuracy of the fault detection algorithm in diagnosing faulty sensors and c04-math-0111 denotes the probability of an unsuccessful replacement of the identified faulty sensor with the good spare sensor. c04-math-0112 depends on the sensor switching circuitry and is usually a constant, whereas c04-math-0113's estimation is challenging as different fault detection algorithms have different accuracies. We analyze different fault detection algorithms [91, 102, 103, 106] and observe that the accuracy of a fault detection algorithm depends on the average number of sensor node neighbors c04-math-0114 and the cumulative probability of sensor failure c04-math-0115. We model c04-math-0116 with the empirical relation:

where c04-math-0118 is a function of c04-math-0119 and denotes an adjustment parameter that may correspond loosely to the desired average number of neighboring sensor nodes required to achieve a good fault detection accuracy for a given c04-math-0120. We have derived Eq. (4.6) by experimenting with the relationship between the fault detection algorithms' parameters (i.e., the average number of sensor node neighbors), sensor failure probability, and the fault detection algorithms' accuracy. We point out that Eq. (4.6) provides a good estimate of c04-math-0121 in general for any fault detection algorithm, whereas exact c04-math-0122 for a particular fault detection algorithm can be derived from the algorithm's mathematical model. Equation (4.6) approximates the relationship between a fault detection algorithm's parameters to obtain the fault detection algorithm's accuracy (in case the accuracy is unknown). In practice, a fault detection algorithm's accuracy should be determined accurately by the algorithm's designer that would alleviate the need of using an approximation (as in Eq. (4.6)). To clarify further, we point out that our Markov models are independent of c04-math-0123's determination methodology and are equally applicable to any c04-math-0124 value.

4.4.1.2 Sensor Failure Rate

The sensor failure rate can be represented by exponential distribution with a failure rate of c04-math-0125 over the period c04-math-0126 [131]. The exponential distribution, which has a property of constant failure rate, is a good model for the long, flat intrinsic failure portion of the bathtub curve. The exponential distribution is applicable to a variety of practical situations since many embedded components and systems spend most of their lifetimes in the flat portion of the bathtub curve. Furthermore, any failure rate curve can be approximated by piecewise exponential distribution segments patched together. Each exponential distribution segment in the overall approximation specifies a constant failure rate over a small time unit (e.g., daily, weekly, or monthly) that is the average of the failure rates during the respective time duration. The constant failure rate assignment for each exponential distribution segment is justifiable as many natural phenomena have a constant failure rate (or occurrence rate) property (e.g., the arrival rate of cosmic ray alpha particles). The failure rate curve approximation by piecewise exponential distributions is analogous to a curve approximation by piecewise straight line segments.

The exponential model works well for the interarrival times where the total number of events in a given time period is given by the Poisson distribution. When these events trigger failures, the exponential lifetime distribution model naturally applies [132]. The cumulative distribution function (CDF) for the sensors with exponentially distributed failure rate is

where c04-math-0128 denotes the cumulative probability of sensor failure (for simplicity) and c04-math-0129 signifies the time over which c04-math-0130 is specified. Solving Eq. (4.7) for c04-math-0131 gives

4.4.2 Fault-Tolerant Sensor Node Model

As a base case, we describe an NFT sensor node Markov model (Fig. 4.4) containing one sensor (temperature sensor in this case, but the sensor type is arbitrary). The NFT sensor node model consists of two states: state 1 (good state) and state 0 (failed state). The NFT sensor node fails when the node transitions from state 1 to state 0 due to a sensor failure. The differential equations describing the NFT sensor node Markov model are

where c04-math-0134 denotes the probability that the sensor node will be in state c04-math-0135 at time c04-math-0136, and c04-math-0137 represents the first-order derivative of c04-math-0138. c04-math-0139 represents the failure rate of an active temperature sensor.

c04f004

Figure 4.4 A non-FT (NFT) sensor node Markov model

Solving Eq. (4.9) with the initial conditions c04-math-0140 and c04-math-0141 yields

4.10 equation

The reliability of the NFT sensor node is

4.11 equation

The MTTF of the NFT sensor node is

4.12 equation

The average failure rate of the NFT sensor node is

4.13 equation

Since sensors have comparatively higher fault rates than other components [88, 93, 94], we propose an FT duplex sensor node model consisting of one active sensor and one inactive spare sensor. Using TMR [92] for FT is a possible scenario, but we consider a duplex sensor node model to minimize the additional cost as the additive cost of spare sensors can be prohibitive for large WSNs. In addition, a duplex model limits the increase in sensor node size. We point out that our modeling methodology can be extended to sensor nodes with NMR.

In our duplex sensor node, the inactive sensor becomes active only once the active sensor is declared faulty by the fault detection algorithm. We refer to our duplex sensor node as an FT sensor node, whose Markov model is depicted in Fig 4.5. The states in the Markov model represent the number of good sensors. The differential equations describing the FT sensor node Markov model are

where c04-math-0147 denotes the probability that the sensor node will be in state c04-math-0148 at time c04-math-0149 and c04-math-0150 represents the first-order derivative of c04-math-0151. c04-math-0152 represents the failure rate of an active temperature sensor and c04-math-0153 is the rate at which recoverable failure occurs. The probability that the sensor failure cannot be recovered is c04-math-0154 and the rate at which unrecoverable failure occurs is c04-math-0155.

c04f005

Figure 4.5 FT sensor node Markov model [130]

Solving Eq. (4.14) with the initial conditions c04-math-0156, c04-math-0157, and c04-math-0158 yields

4.15 equation

The reliability of the FT sensor node is

where subscript c04-math-0161 in c04-math-0162 stands for duplex. The MTTF of the FT sensor node is

4.17 equation

The average failure rate of the FT sensor node depends on c04-math-0164 (since the fault detection algorithm's accuracy depends on c04-math-0165 (Section 4.4.1)):

where c04-math-0167 and c04-math-0168 denote the failure rate and MTTF of an FT sensor node with c04-math-0169 neighbors.

4.4.3 Fault-Tolerant WSN Cluster Model

A typical WSN consists of many clusters (Fig. 4.1), and we assume for our model that all nodes in a cluster are neighbors to each other (could be one-hop or two-hop neighbors depending on the topology). We note that our model does not impose any specific topology on a sensor network cluster. If the average number of nodes in a cluster is c04-math-0170, then the average number of neighboring nodes per sensor node is c04-math-0171. Figure 4.6 depicts our Markov model for a WSN cluster. We assume that a cluster fails (i.e., fails to perform its assigned application task) if the number of alive (nonfaulty) sensor nodes in the cluster reduces to c04-math-0172. The differential equations describing the WSN cluster Markov model are

4.19 equation

where c04-math-0174, c04-math-0175, and c04-math-0176 represent the FT (duplex) sensor node failure rate (Eq. (4.18)) when the average number of neighboring sensor nodes are c04-math-0177, c04-math-0178, and c04-math-0179, respectively. For mathematical tractability and closed-form solution, we analyze a special (simple) case when c04-math-0180, which reduces the Markov model to three states (Fig. 4.7). The differential equations describing the WSN cluster Markov model when c04-math-0181 are

c04f006

Figure 4.6 WSN cluster Markov model [130]

c04f007

Figure 4.7 WSN cluster Markov model with three states [130]

Solving Eq. (4.20) with the initial conditions c04-math-0183, c04-math-0184, and c04-math-0185 yields

where c04-math-0187 denotes c04-math-0188 in Eq. (4.21) for conciseness. The reliability of the WSN cluster is

4.22 equation

The MTTF of the WSN cluster is

where c04-math-0191 denotes c04-math-0192 in Eq. (4.23) for conciseness. The average failure rate of the cluster c04-math-0193 depends on the average number of nodes in the cluster c04-math-0194 at deployment time:

where c04-math-0196 denotes the MTTF of a WSN cluster of c04-math-0197 sensor nodes.

4.4.4 Fault-Tolerant WSN Model

A typical WSN consists of c04-math-0198 clusters where c04-math-0199 denotes the total number of sensor nodes in the WSN and c04-math-0200 denotes the average number of nodes in a cluster. Figure 4.8 depicts our WSN Markov model. We assume that the WSN fails to perform the WSN's assigned task when the number of alive clusters reduces to c04-math-0201. The differential equations describing the WSN Markov model are

4.25 equation

where c04-math-0203 represents the average cluster failure rate (Eq. (4.24)) when the cluster contains c04-math-0204 sensor nodes at deployment time. For mathematical tractability, we analyze a special (simple) case when c04-math-0205, which reduces the Markov model to three states. The differential equations describing the WSN Markov model when c04-math-0206 are

Solving Eq. (4.26) with the initial conditions c04-math-0208, c04-math-0209, and c04-math-0210 yields

4.27 equation

The WSN reliability is

4.28 equation

The WSN MTTF when c04-math-0213 is

4.29 equation

Discussion: Our hierarchical Markov modeling exploits the relationship between failure rates at the sensor component, sensor node, WSN cluster, and WSN levels. Markov modeling of sensor nodes (Section 4.4.2) takes the failure rate of the sensor node's components c04-math-0215 (e.g., temperature sensor) as input and then derives the failure rate of the sensor node c04-math-0216 (for simplicity, the failure rate of one component is considered for illustration; however, the failure rates of different components can be considered or combined into a single representative failure rate).

c04f008

Figure 4.8 WSN Markov model [130]

For an FT sensor node model (e.g., duplex sensor node or triple modular redundant sensor node), the average failure rate of the FT sensor node implicitly depends on c04-math-0217 since the fault detection algorithm's accuracy depends on c04-math-0218 (Sections 4.3 and 4.4.1.1). Hence, the failure rate of an FT sensor node is calculated for different c04-math-0219 values in our numerical results (Section 4.6). The WSN cluster Markov model relies on the failure rates calculated from the FT sensor node Markov model for different values of c04-math-0220. For example, for a cluster with c04-math-0221 sensor nodes (Fig. 4.6) and assuming each sensor node has all other sensor nodes in the cluster as neighbors, the number of neighboring sensor nodes for a given sensor node is c04-math-0222. Thus, the failure rate of an FT sensor node is calculated from the sensor node Markov model with c04-math-0223, which is then used in the WSN cluster Markov model (Fig. 4.6). When a sensor node fails, the remaining operational (healthy) neighboring nodes of a given sensor node in a cluster become c04-math-0224; thus, the failure rate of an FT sensor node is calculated from the sensor node's Markov model with c04-math-0225 and so on. The cluster becomes nonoperational when the number of operational sensor nodes in the cluster reaches a minimum c04-math-0226 and consequently the overall failure rate of a WSN cluster with c04-math-0227 sensor nodes is calculated for that c04-math-0228.

The WSN cluster failure rate with c04-math-0229 sensor nodes obtained from the WSN cluster Markov model is then used in the WSN Markov model (Section 4.4.4). The WSN is assumed to be consisting of c04-math-0230 clusters and WSN becomes nonoperational when the number of alive (operational) clusters reduces to c04-math-0231. The WSN Markov model is then used to calculate the MTTF and reliability of a given WSN using the cluster failure rate calculated from the WSN cluster Markov model.

4.5 Simulation of Distributed Fault Detection Algorithms

In this section, we discuss our implementation of two DFD algorithms: the Chen and Ding algorithms (Section 4.3). Our analysis is challenging since it is difficult to compare fault detection algorithms without a standard benchmark (Section 4.7.2). Since different DFD algorithms can use different fault models, a comparison between DFD algorithms can be made by observing the performance of the algorithms under similar fault models and similar topologies.

4.5.1 Using nsc04-math-02322 to Simulate Faulty Sensors

In prior work, the Chen and Ding algorithms were simulated strictly algorithmically using simplified fault models that approximated a Poisson process. To compare these two models as realistically as possible, we use nsc04-math-02332 [101], which is a widely used discrete event simulator for both wired and wireless networks, and a custom fault model. nsc04-math-02342 allows us to see how these algorithms perform atop a detailed network stack over a wireless channel. We have leveraged nsc04-math-02352 instead of nsc04-math-02363 as nsc04-math-02372 provides better support for WSN and mobile ad hoc network (MANET) protocols [133].

Wireless entities in nsc04-math-02382 are called nodes and have modules called agents representing network layers stacked atop the nodes. Agents are typically used to generate application traffic, but we use agents as a fault detection unit in a sensor node (Fig. 4.9). The agent first receives scheduled, scripted commands to read values from a SensorNodeData module. The SensorNodeData module generates random sensor readings based on our implemented fault models.

c04f009

Figure 4.9 The nsc04-math-02392-based simulation architecture

After recording a data sample, the sensor node broadcasts the data as a packet and builds a record of each broadcasted packet that the sensor node receives from other sensor nodes. After a listening period, a separately scheduled event commands the detection agent to apply the two fault detection algorithms (i.e., the Ding and Chen algorithms). A fault tracker unit records statistics of detected and undetected errors. In the case of the Chen algorithm, each round of tendency broadcasting occurs in two-second intervals. The sensor readings are also broadcasted similarly in two-second intervals.

A duplex configuration was implemented for permanent errors. When a sensor node passes a threshold of detected errors, the sensor “fails” and is replaced by a spare sensor. For the simplified fault model, this replacement helps little because Poisson events are memoryless, and a sensor node with a new sensor and a sensor node with an old sensor will have the same error rate. However, for realistic data, constant errors are permanent and are repaired when the sensor is replaced. The realistic data usage also results in a penalty of false positives and causes more sensor nodes to be identified as faulty. The following sections describe our experiments with both simulated and real-world data.

4.5.2 Experimental Setup for Simulated Data

To compare the performance of each algorithm, it is useful to observe each algorithm using a simplified fault model; thus, we consider the scenario used by Chen et al. [90]. Each sensor records a uniformly random temperature between 70c04-math-0240 and 75c04-math-0241. Errors occur as a Poisson process, with an arbitrary, scalable mean, and corrupt a single sensor on a single sensor node during the event. A corrupted sample is randomly distributed between 100c04-math-0242 and 105c04-math-0243. We simulate each sensor node for 4 h and take samples every 30 s.

Fig. 4.10(a) and (b) depicts the detection performance of the Chen algorithm across different parameters and error rates. We observe that for the Chen algorithm, the error rate affects detection performance significantly; however, the c04-math-0244 value has a relatively minor impact. Fig. 4.11(a) and (b) depicts the detection performance of the Ding algorithm across different parameters and error rates. For the Ding algorithm, as c04-math-0245 is scaled, the number of false positives scales proportionately. For both algorithms, low error rates improve detection accuracy as low error rates make errors more distinguishable. In the simulated data scenario, the Chen algorithm performs better than the Ding algorithm in terms of the percentage of false positives, and the detection accuracy is relatively less dependent on c04-math-0246. However, this simulated scenario is very specific and consistent, which may not always be the case.

c04f010

Figure 4.10 Effectiveness and false positive rate of the Chen algorithm: (a) error detection accuracy for the Chen algorithm; (b) false positive rate of Chen algorithm

c04f011

Figure 4.11 Effectiveness and false positive rate for the Ding algorithm: (a) error detection accuracy for the Ding algorithm; (b) false positive rate for the Ding algorithm

4.5.3 Experiments Using Real-World Data

Since transient faults are manifested in different ways, a more realistic error model that closely follows a real-world scenario is desirable for analysis. For our real-world data case study, we leverage publicly available sensor data from the Intel Berkeley Research Laboratory [134]. The data comes from an indoor sensor network deployment that measures temperature, humidity, and other metrics. Using temperature data from 53 sensor nodes, we have produced a variety of empirical observations on which we have based our simulation.

We note a few empirical observations from the Intel Berkeley data. One observation is that the real-world data includes noise. Noise in sensor readings is correlated with past values (autocorrelated) and can have high power levels for certain time periods. To measure the effect of noise on sensor data, we apply a denoising filter available in the MATLAB Wavelet toolkit to the Intel Berkeley data to approximate a reference “noise-free” signal. We measure the power level of noise in the data (Fig. 4.12) and record the values in a text file. We play back the recorded noise power level values during simulation, and multiply the noise values with simulated white Gaussian noise to imitate the bursty behavior of noisy faults.

c04f012

Figure 4.12 Noise power levels in the Intel Berkeley sample

Another observation from the Intel Berkeley data is that for nearly all sensor nodes, sensor samples level off as they approach their end-of-life, resulting in a “constant” fault. A histogram of the times at which this occurs (Fig. 4.13) reveals that the distribution of sensor node lifetimes fits well into a Student's c04-math-0247-distribution, which is supported by MATLAB's distribution fitting tools. For our experiments, we have recorded a set of several thousand random numbers fitting the Student's c04-math-0248-distribution. The sensor nodes then randomly select a value from the recorded c04-math-0249-distribution values to select a time for the fault occurrence.

c04f013

Figure 4.13 Distribution of constant error occurrences

Finally, we observe that voltage spikes are not frequent in the Intel Berkeley data set, occurring in only 6 out of 53 nodes. Because of the rare occurrence of voltage spikes, our simulations inject a spike once during the lifetime of each sensor node. The time of the spike occurrence is modeled using a uniform process. The base signal is a sine wave with a period of 2 h between 20 and 30 c04-math-0250C. To allow minor variations in temperature readings sensed by sensor nodes, the temperature values are treated as being generated from a radiating heat source. The nodes furthest from the center have a lower temperature proportional to the inverse of the distance squared as in the free space path loss model. The noise and spike errors are added to the signal, and the constant fault causes the signal to hang at the signal's last valid value during an error. A signal is considered erroneous if the variation is more than half a degree from the signal's true value. In our experiments, the faulty signals are never repaired unless faults accumulate, and ultimately the sensor is replaced if diagnosed as faulty.

Figure 4.14 shows the performance of the Chen algorithm for the real-world data. Despite detecting over 90% of the faults in the simplified model, the algorithm fails to perform well with the real-world data, detecting fewer than 50% of erroneous readings. The poor performance of the Chen algorithm on the real-world data is due to the nature of the algorithm. For an error to be detected, the error must be not only disruptive, causing a node to have different values from the node's neighbors, but also transient and have a sharp transition boundary. Since noise events last for multiple samples, the Chen algorithm is not equipped to handle these errors.

c04f014

Figure 4.14 Error detection and false positive rate for the Chen algorithm using real-world data

The Ding algorithm's performance with real-world data is shown in Fig. 4.15. Similar to the Chen algorithm, the Ding algorithm's performance degrades as compared to the previous simplified model environment. However, the Ding algorithm still detects a majority of occurring faults. As with the previous simplified model experiment, the Ding algorithm is highly dependent on the algorithm's parameters. The higher the detection rate is, the higher the rate of false positives is for the Ding algorithm; however, the false positive rate remains low, peaking at 20%.

c04f015

Figure 4.15 Error detection and false positive rate for the Ding algorithm using real-world data

4.6 Numerical Results

In this section, we present reliability and MTTF results for our Markov models (Section 4.4) implemented in the SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) Software Package [135, 136]. SHARPE provides MTTF results directly based on our models' implementations; however, SHARPE does not provide reliability results directly, but reliability results can be calculated from state probabilities. We present example reliability calculations as well as detailed reliability and MTTF results for an NFT as well as an FT sensor node, WSN cluster, and WSN using our bottom-up Markov modeling paradigm.

4.6.1 Experimental Setup

In this section, we describe our FT sensor node, WSN cluster, and WSN model implementation in the SHARPE Software Package [135, 136]. We also present a typical fault detection algorithm's accuracy for different cumulative sensor failure probability c04-math-0251 values. Owing to SHARPE limitations that take only exponential polynomials, we assume our sensor failure rate to be exponentially distributed (Section 4.4.1.2).

SHARPE is a software tool for performance, reliability, and performability model specification and analysis. The SHARPE toolkit provides a specification language and solution methods for commonly used performance, reliability, and performability model types. Supported models of SHARPE include combinatorial models, state-space (e.g., Markov and semi-Markov reward models), and stochastic Petri nets. SHARPE allows computation of steady-state, transient, and interval measures. SHARPE allows output measures of a model to be used as parameters of other models to facilitate the hierarchical combination of different model types.

Our Markov model exploits the synergy of fault detection and fault tolerance for WSNs. Table 4.3 depicts c04-math-0252 values estimated for a DFD algorithm using Eq. (4.6) for different c04-math-0253 and c04-math-0254 values. These estimated c04-math-0255 values approximate the accuracy of the DFD algorithms proposed in [91, 102, 103, 106]. We note that any inaccuracy in the c04-math-0256 estimation does not affect our calculations of results because these calculations leverage c04-math-0257 values but are independent of whether these values reflect accurately a particular DFD algorithm's accuracy. We assume c04-math-0258 in Eq. (4.5) (i.e., once a faulty sensor is identified, the faulty sensor is replaced by a good spare sensor perfectly), which gives c04-math-0259 in Eq. (4.5).

Table 4.3 Estimated values for a fault detection algorithm's accuracy

c04-math-0260 c04-math-0261 c04-math-0262
5 6 7 10 15
0.05 25 0.979 1 1 1 1
0.1 50 0.858 0.895 0.921 0.957 0.96
0.2 56 0.755 0.717 0.81 0.845 0.851
0.3 65 0.652 0.679 0.699 0.732 0.742
0.4 66 0.558 0.5813 0.599 0.627 0.636
0.5 67 0.464 0.484 0.498 0.522 0.53
0.6 68 0.371 0.386 0.398 0.417 0.424
0.7 69 0.278 0.29 0.298 0.313 0.318
0.8 70 0.185 0.193 0.198 0.208 0.212
0.9 71 0.092 0.096 0.099 0.104 0.106
0.99 72 0.0092 0.00962 0.0099 0.0104 0.0106

4.6.2 Reliability and MTTF for an NFT and an FT Sensor Node

In this section, we present the reliability and MTTF results for an NFT and an FT sensor node for the two cases: c04-math-0263 and c04-math-0264, when c04-math-0265 and c04-math-0266. The case c04-math-0267 corresponds to a real-world fault detection algorithm (typical values are shown in Table 4.3), whereas the case c04-math-0268 corresponds to an ideal fault detection algorithm, which detects faulty sensors perfectly for all c04-math-0269 values. We calculate reliability at c04-math-0270 days since we assume c04-math-0271 days [137] in Eq. (4.8); however, we point out that reliability can be calculated for any other time value using our Markov models.

For an NFT sensor node reliability calculation, we require the sensor failure rate c04-math-0272, which can be calculated using Eq. (4.8) (i.e., c04-math-0273 failures/day). SHARPE gives c04-math-0274 and sensor node reliability c04-math-0275. Evaluating c04-math-0276 at c04-math-0277 gives c04-math-0278 = 0.94999.

For an FT sensor node when c04-math-0279, different reliability results are obtained for different c04-math-0280 because the fault detection algorithm's accuracy and coverage factor c04-math-0281 depend on c04-math-0282. For c04-math-0283, c04-math-0284 (Table 4.3), SHARPE gives c04-math-0285 = c04-math-0286 and c04-math-0287 = c04-math-0288. The reliability c04-math-0289 = c04-math-0290, which when evaluated at c04-math-0291 = 100 gives c04-math-0292 = 0.99770. For an FT sensor node when c04-math-0293 for all c04-math-0294, SHARPE gives c04-math-0295 and c04-math-0296. Using Eq. (4.16), the reliability c04-math-0297 = 0.99872.

Table 4.4 shows the reliability for an NFT and an FT sensor node evaluated at c04-math-0298 days for c04-math-0299 values of 5, 10, and 15. As expected, the results show that reliability decreases for both NFT and FT sensor nodes as c04-math-0300 increases, and the reliability attained by an FT sensor node (both for c04-math-0301 and c04-math-0302) is always better than an NFT sensor node. For example, the percentage reliability improvement achieved by an FT sensor node with c04-math-0303, c04-math-0304 over an NFT sensor node is 18.98% when c04-math-0305. However, an FT sensor node with c04-math-0306 outperforms both an FT sensor node with c04-math-0307 and an NFT sensor node for all c04-math-0308 values. For example, the percentage improvement in reliability for an FT sensor node with c04-math-0309 over an NFT sensor node is 69.09%, and the percentage improvements in reliability for an FT sensor node with c04-math-0310 over an FT sensor node (c04-math-0311) with c04-math-0312 equal to 5, 10, and 15 are 28.11%, 24.32%, and 23.82%, respectively, for c04-math-0313. The percentage improvement in reliability attained by an FT sensor node with c04-math-0314 over an NFT sensor node is 230%, and the percentage improvements in reliability for an FT sensor node with c04-math-0315 over an FT sensor node (c04-math-0316) with c04-math-0317 equal to 5, 10, and 15 are 172.36%, 166.31%, and 165.32%, respectively, for c04-math-0318. These results reveal that the percentage improvement in reliability attained by an FT sensor node with c04-math-0319 increases as c04-math-0320 because for an FT sensor node with c04-math-0321, c04-math-0322 decreases as c04-math-0323 (Table 4.3). The sensor node reliability analysis reveals that a robust fault detection algorithm with c04-math-0324 for all c04-math-0325 and c04-math-0326 values is necessary to attain good reliability for an FT sensor node.

Table 4.4 Reliability for an NFT and an FT sensor node

Probability NFT FT FT FT FT
c04-math-0327 c04-math-0328 c04-math-0329 c04-math-0330 c04-math-0331 c04-math-0332
0.05 0.94999 0.99770 0.99872 0.99872 0.99872
0.1 0.89996 0.98135 0.99074 0.99102 0.99482
0.2 0.80011 0.93481 0.95088 0.95195 0.97853
0.3 0.69977 0.86265 0.88263 0.88513 0.94959
0.4 0.59989 0.77094 0.79209 0.79485 0.90643
0.5 0.5007 0.66086 0.68097 0.68374 0.84662
0.6 0.40012 0.53610 0.55295 0.55552 0.76663
0.7 0.30119 0.40167 0.41432 0.41612 0.66262
0.8 0.19989 0.25943 0.26683 0.26812 0.52171
0.9 0.10026 0.12148 0.12424 0.12470 0.33086
0.99 0.01005 0.01048 0.01053 0.01056 0.05628

Figure 4.16 depicts the MTTF for an NFT and an FT sensor node for c04-math-0333 values of 5, 10, and 15 versus the sensor failure probability c04-math-0334 when c04-math-0335 in Eq. (4.7) is 100 days [91, 106]. The results show that the MTTF for an FT sensor node improves with increasing c04-math-0336. However, the MTTF shows negligible improvement when c04-math-0337 over c04-math-0338 as the fault detection algorithm's accuracy improvement gradient (slope) decreases for large c04-math-0339 values.

c04f016

Figure 4.16 MTTF in days for an NFT and an FT sensor node [130]

Figure 4.16 also compares the MTTF for an FT sensor node when c04-math-0340. When c04-math-0341 for existing fault detection algorithms, comparison with c04-math-0342 provides insight into how the fault detection algorithm's accuracy affects the sensor node's MTTF. Figure 4.16 shows that the MTTF for an FT sensor node with c04-math-0343 is always greater than an FT sensor node with c04-math-0344.

Figure 4.16 shows that the MTTF for an NFT and an FT sensor node decreases as c04-math-0345 increases; however, the FT sensor node maintains better MTTF than the NFT sensor node for all c04-math-0346 values. This observation reveals that a sensor with a lower failure probability achieves better MTTF for both the NFT and FT sensor nodes. Figure 4.16 also shows that the MTTF for both the NFT and the FT sensor nodes approaches zero as c04-math-0347 approaches 1 (i.e., MTTF c04-math-0348 0 c04-math-0349 1). This observation is intuitive because a faulty sensor (with c04-math-0350) is unreliable and leads to a failed sensor node with zero MTTF and suggests that depending on the application's reliability and MTTF requirements, the faulty sensor should be replaced before c04-math-0351 approaches 1.

Table 4.5 depicts the percentage MTTF improvement gained by an FT sensor node over an NFT sensor node for different values of c04-math-0352. We calculate the percentage improvement as c04-math-0353. The table shows that the MTTF percentage improvement for an FT sensor node decreases as c04-math-0354 increases when c04-math-0355. The percentage MTTF improvement for an FT sensor node with c04-math-0356 and c04-math-0357 are 85.75% and 95.65%, respectively, for c04-math-0358 and drops to 0.875% and 1.34%, respectively, for c04-math-0359. This observation reveals that having more neighboring sensor nodes (a higher c04-math-0360 value) improves MTTF because, in general, a fault detection algorithm's accuracy and c04-math-0361 improve with increasing c04-math-0362. Table 4.5 also shows that the MTTF percentage improvement for an FT sensor node over an NFT sensor node is 100% on average when c04-math-0363, thus highlighting the importance of a robust fault detection algorithm.

Table 4.5 Percentage MTTF improvement for an FT sensor node as compared to an NFT sensor node

Probability FT FT FT
c04-math-0364 c04-math-0365 c04-math-0366 c04-math-0367
0.1 85.75 95.65 100
0.2 75.61 84.63 100
0.3 65.14 72.99 100
0.5 46.25 52.49 100
0.7 27.62 31.23 100
0.9 9.37 10.52 100
0.99 0.875 1.34 100

Model Validation: We compare the MTTF results for an NFT sensor node obtained using our Markov model with other existing models in the literature as well as practical sensor node implementations to provide insights into the accuracy of our models. We note that a detailed comparison and verification of our model with existing models is not possible because there are no similar/related models that leverage FT constructs. Jung et al. [71] modeled the lifetime of trigger-driven sensor nodes with different event arrival rates (trigger-driven sensor nodes perform processing based on events as opposed to duty-cycle-based sensor nodes that perform processing based on a duty cycle), which can be interpreted as sensing events that depend on the sensor node's sensing frequency. Results were obtained for three trigger-driven sensor node platforms: XSM, MICAz, and Telos. XSM and MICAz consisted of an ATmega128L 8-bit processor with a maximum processor frequency of 16 MHz and a maximum performance of 16 million instructions per second (MIPS). Telos consisted of a TI MSP430 16-bit processor with a maximum processor frequency of 8 MHz and a maximum performance of 16 MIPS. Results revealed that Telos, XSM, and MICAz delivered lifetimes of 1500, 1400, and 600 days, respectively, with an event arrival rate of 0.1 events/h, which dropped to 350, 100, and 0 days, respectively, with an event arrival rate of 100 events/h. The lifetime values for an NFT sensor node obtained from our Markov model ranges from 949 to 22 days for different sensor failure probabilities. Depending on the sensor node energy consumption, the lifetime model proposed in [70] estimated that the sensor node lifetime varied from 25 to 215 days. These comparisons indicate that our Markov model yields sensor node lifetime values in the range obtained from other existing models.

We also compare the NFT sensor node lifetime estimation from our model with an experimental study on sensor node lifetimes [74]. This comparison verifies conformity of our model with real measurements. For example, with a sensor node battery capacity of 2500 mA h, experiments indicate a sensor node lifetime ranging from 72 to 95 h for a 100% duty cycle for different battery brands (e.g., Ansmann, Panasonic Industrial, Varta High Energy, Panasonic Extreme Power) [74]. The sensor node lifetime for a duty cycle of 18% can be estimated as 95/0.18 = 528 h c04-math-0368 22 days and for a duty cycle of 0.42% as 95/0.0042 = 22,619 h c04-math-0369 942 days. This comparison reveals that our NFT sensor node lifetime model estimates lifetime values in the range obtained by experimental sensor node implementations.

4.6.3 Reliability and MTTF for an NFT and an FT WSN Cluster

In this section, we present the reliability and MTTF results for two WSN clusters that contain on average c04-math-0370 and c04-math-0371 sensor nodes at deployment time (c04-math-0372). The selection of two different c04-math-0373 values provides insight into how the size of the cluster affects reliability and MTTF (other c04-math-0374 and c04-math-0375 values depicted similar trends). The NFT WSN cluster consists of NFT sensor nodes and the FT WSN cluster consists of FT sensor nodes. For brevity, we present example WSN cluster reliability calculations for c04-math-0376 (c04-math-0377) and c04-math-0378.

Leveraging our bottom-up approach model, the NFT WSN cluster uses the failure rate c04-math-0379 of an NFT sensor node. By using Eq. (4.8), c04-math-0380 = c04-math-0381 failures/day. Since the coverage factor c04-math-0382 does not appear in the reliability (and MTTF) calculation for an NFT sensor node, the failure rate c04-math-0383 = c04-math-0384 = c04-math-0385 = c04-math-0386 failures/day. In other words, for an NFT sensor node, c04-math-0387 = c04-math-0388, c04-math-0389, and c04-math-0390 = c04-math-0391, which denote the sensor node failure rate when the average number of neighboring sensor nodes are 4 and 5, respectively. The NFT WSN cluster reliability calculation utilizes c04-math-0392 and c04-math-0393, which gives c04-math-0394 = c04-math-0395 and c04-math-0396 = c04-math-0397. The reliability c04-math-0398 = c04-math-0399 + c04-math-0400, which when evaluated at c04-math-0401 days gives c04-math-0402.

For an FT WSN cluster when c04-math-0403, the reliability calculation requires c04-math-0404 and c04-math-0405, which depends on c04-math-0406 as the fault detection algorithm's accuracy and c04-math-0407 depends on c04-math-0408, yielding different reliability results for different c04-math-0409 values. Using Table 4.3 and Eq. (4.18), c04-math-0410 = c04-math-0411 = c04-math-0412 = c04-math-0413 = c04-math-0414 failures/day and c04-math-0415 = c04-math-0416 = c04-math-0417 = c04-math-0418 = c04-math-0419 failures/day. SHARPE gives c04-math-0420 = c04-math-0421 and c04-math-0422. The reliability of the FT WSN cluster is calculated as c04-math-0423 = 0.95969.

For an FT WSN cluster when c04-math-0424, the reliability calculation does not depend on c04-math-0425, which gives c04-math-0426 failures/day and c04-math-0427 failures/day. SHARPE gives c04-math-0428 and c04-math-0429, from which we calculate c04-math-0430.

Table 4.6 shows the reliability for an NFT and an FT WSN cluster (for c04-math-0431 and c04-math-0432) evaluated at c04-math-0433 days when c04-math-0434 (c04-math-0435). We observe similar trends as with sensor node reliability (Table 4.4) where the reliability of both NFT and FT WSN clusters decreases as c04-math-0436 increases (i.e., reliability c04-math-0437) because decreased individual sensor node reliability decreases the overall WSN cluster reliability. Table 4.6 depicts that an FT WSN cluster with c04-math-0438 outperforms an FT WSN cluster with c04-math-0439 and an NFT WSN cluster for all c04-math-0440 values. For example, the percentage improvement in reliability for an FT WSN cluster with c04-math-0441 over an NFT WSN cluster and an FT WSN cluster with c04-math-0442 is 77.33% and 12.02% for c04-math-0443 and 601.12% and 142.73% for c04-math-0444, respectively. These results show that the percentage improvement in reliability attained by an FT WSN cluster increases as c04-math-0445 increases because a fault detection algorithm's accuracy and c04-math-0446 decrease as c04-math-0447 increases (Table 4.3). This trend is similar to the percentage improvement in reliability for an FT sensor node (Section 4.6.2). The results show that an FT WSN cluster always performs better than an NFT WSN cluster. For example, the percentage improvement in reliability for an FT WSN cluster with c04-math-0448 over an NFT WSN cluster is 58.31% for c04-math-0449.

Table 4.6 Reliability for an NFT and an FT WSN cluster when c04-math-0450 (c04-math-0451)

c04-math-0452 NFT FT c04-math-0453 FT c04-math-0454
0.05 0.96720 0.99058 0.99098
0.1 0.88562 0.95969 0.96552
0.2 0.65567 0.84304 0.87474
0.3 0.41968 0.66438 0.74422
0.4 0.23309 0.45925 0.59388
0.5 0.10942 0.26595 0.43653
0.6 0.04098 0.11837 0.28732
0.7 0.01114 0.03548 0.16279
0.8 c04-math-0455 c04-math-0456 0.06723
0.9 c04-math-0457 c04-math-0458 0.01772
0.99 c04-math-0459 c04-math-0460 c04-math-0461

Figure 4.17 depicts the MTTF for an NFT WSN cluster and an FT WSN cluster versus c04-math-0462 when c04-math-0463 for average cluster sizes of c04-math-0464 and c04-math-0465 sensor nodes, respectively, at deployment time. The figure reveals that the FT WSN cluster's MTTF is considerably greater than the NFT WSN cluster's MTTF for both cluster sizes. Figure 4.17 also compares the MTTF for an FT WSN cluster when c04-math-0466 with c04-math-0467 and shows that the MTTF for an FT WSN cluster with c04-math-0468 is always better than an FT WSN cluster with c04-math-0469. This observation again verifies the significance of a robust (ideal) fault detection algorithm, which can ideally provide c04-math-0470 values close to 1 (i.e., c04-math-0471 for all c04-math-0472 values). We note that both an NFT and an FT WSN clusters with c04-math-0473 have redundant sensor nodes and can inherently tolerate c04-math-0474 sensor node failures. The WSN cluster with c04-math-0475 has more redundant sensor nodes than the WSN cluster with c04-math-0476, resulting in a comparatively greater MTTF. Figure 4.17 shows that the MTTF for both an NFT and an FT WSN clusters approaches zero as c04-math-0477 approaches 1 and further solidifies the necessity of low failure probability sensors to achieve better MTTF in WSN clusters. The MTTF variation for a WSN cluster with varying c04-math-0478 is similar to the MTTF variation for a sensor node (Fig. 4.16) because a WSN cluster is composed of sensor nodes and reflects the MTTF variation of WSN cluster's constituent sensor nodes.

c04f017

Figure 4.17 MTTF in days for an NFT WSN cluster and an FT WSN cluster with c04-math-0479 [130]

Table 4.7 shows the percentage MTTF improvement for an FT WSN cluster as compared to an NFT WSN cluster for cluster sizes of c04-math-0485 and c04-math-0486 sensor nodes. The percentage improvements are calculated separately for the two cluster sizes and we compare the MTTFs of clusters of the same size. We observe that the MTTF improvement for an FT cluster is slightly greater when c04-math-0487 as compared to when c04-math-0488; however, the MTTF improvement decreases with increasing c04-math-0489 when c04-math-0490. This observation reveals that a WSN cluster consisting of more sensor nodes can achieve better MTTF improvements when compared to a WSN cluster containing fewer sensor nodes. The MTTF percentage improvement for an FT WSN cluster with c04-math-0491, c04-math-0492, is 83.04% for c04-math-0493 and drops to 2.26% for c04-math-0494. Similarly, the percentage MTTF improvement for an FT WSN cluster with c04-math-0495, c04-math-0496, is 87.55% for c04-math-0497 and drops to 2.47% for c04-math-0498. The percentage MTTF improvement for both cluster sizes is 100% on average when c04-math-0499 (results are not shown for c04-math-0500 when c04-math-0501 in Table 4.7 for brevity). We observed that the MTTF percentage improvement for an FT WSN cluster with c04-math-0502 over c04-math-0503 is 103.35% on average. Intuitively, these results confirm that clusters with more redundant sensor nodes are more reliable and have a better MTTF than clusters with fewer redundant sensor nodes.

Table 4.7 Percentage MTTF improvement for an FT WSN cluster as compared to an NFT WSN cluster (c04-math-0480)

Probability FT FT FT
c04-math-0481 c04-math-0482 c04-math-0483 c04-math-0484
0.1 83.04 87.55 100
0.2 73.84 77.24 100
0.3 63.58 66.51 100
0.5 44.97 47.22 100
0.7 26.4 28.71 100
0.9 9.81 9.57 100
0.99 2.26 2.47 100

Since SHARPE provides information on MTTF directly from Markov models whereas reliability calculation requires manual substitution in state transition probabilities, we present iso-MTTF results for WSN clusters. Table 4.8 depicts the MTTFs for an FT WSN cluster and an NFT WSN cluster for c04-math-0504 varying from 0.1 to 0.99 and when c04-math-0505 = 4. Results indicate that an NFT WSN cluster requires three redundant NFT sensor nodes to achieve a comparable MTTF as that of an FT WSN cluster.

Table 4.8 Iso-MTTF for WSN clusters (c04-math-0506). c04-math-0507 denotes the redundant sensor nodes required by an NFT WSN cluster to achieve a comparable MTTF as that of an FT WSN cluster

Probability MTTF FT (days) MTTF NFT (days) NFT
c04-math-0508 c04-math-0509 c04-math-0510 c04-math-0511
0.1 695.89 707.43 3
0.3 205.3 208.86 3
0.5 105.97 107.6 3
0.7 61.213 62.136 3
0.99 15.942 16.209 3

4.6.4 Reliability and MTTF for an NFT and an FT WSN

This section presents the reliability and MTTF results for an NFT WSN and an FT WSN containing c04-math-0512 and c04-math-0513 clusters at deployment time (c04-math-0514). We consider WSNs containing different number of clusters to provide an insight into how the number of clusters affects WSN reliability and MTTF (other c04-math-0515 and c04-math-0516 values depicted similar trends). The FT WSN contains FT sensor nodes, and the NFT WSN contains NFT sensor nodes. We present example WSN reliability calculations for c04-math-0517 (c04-math-0518, i.e., each WSN fails when there are no more active clusters) and c04-math-0519 = 0.2. We assume that both WSNs contain clusters with c04-math-0520 (c04-math-0521) = 9 sensor nodes on average with a cluster failure rate c04-math-0522.

The reliability calculation for an NFT WSN requires the NFT WSN cluster failure rate c04-math-0523, which can be calculated using Eq. (4.24) with c04-math-0524 (i.e., c04-math-0525 = c04-math-0526 failures/day). Using c04-math-0527, SHARPE gives c04-math-0528 and c04-math-0529. The WSN reliability c04-math-0530 when evaluated at c04-math-0531 days gives c04-math-0532.

For an FT WSN when c04-math-0533, the reliability calculation requires the FT WSN cluster failure rate c04-math-0534 (for c04-math-0535) (i.e., c04-math-0536 failures/day). Using c04-math-0537, SHARPE gives c04-math-0538, c04-math-0539. The WSN reliability c04-math-0540.

For an FT WSN cluster when c04-math-0541, the reliability calculation requires the FT WSN cluster failure rate c04-math-0542 (for c04-math-0543) (i.e., c04-math-0544 failures/day). Using c04-math-0545, SHARPE gives c04-math-0546, c04-math-0547, which gives c04-math-0548.

Table 4.9 shows the reliability for an NFT WSN and an FT WSN evaluated at c04-math-0556 days when c04-math-0557 (c04-math-0558) for clusters with nine sensor nodes on average. We observe similar trends as with sensor node reliability (Table 4.4) and WSN cluster reliability (Table 4.6) where reliability for both an NFT WSN and an FT WSN decreases as c04-math-0559 increases (i.e., reliability c04-math-0560) because a WSN contains clusters of sensor nodes and decreased individual sensor node reliability, with increasing c04-math-0561 decreases both WSN cluster and WSN reliability. Table 4.9 shows that an FT WSN with c04-math-0562 outperforms an FT WSN with c04-math-0563 and an NFT WSN for all c04-math-0564 values. For example, the percentage improvement in reliability for an FT WSN with c04-math-0565 over an NFT WSN and an FT WSN with c04-math-0566 is 5.1% and 0.51% for c04-math-0567 and 350.18% and 236.22% for c04-math-0568, respectively. These results show that the percentage improvement in reliability attained by an FT WSN increases as c04-math-0569 increases because the fault detection algorithm's accuracy and c04-math-0570 decreases as c04-math-0571 increases (Table 4.3). This trend is similar to the percentage improvement in reliability for an FT sensor node (Section 4.6.2) and an FT WSN cluster (Section 4.6.3). The results show that an FT WSN always performs better than NFT WSN. For example, the percentage improvement in reliability for an FT WSN with c04-math-0572 over an NFT WSN is 4.57% for c04-math-0573.

Table 4.9 Reliability for an NFT WSN and an FT WSN when c04-math-0549 (c04-math-0550)

c04-math-0551 NFT FT c04-math-0552 FT c04-math-0553
0.05 0.99557 0.99883 0.99885
0.1 0.98261 0.99474 0.99534
0.2 0.93321 0.97583 0.98084
0.3 0.85557 0.93775 0.95482
0.4 0.75408 0.87466 0.91611
0.5 0.63536 0.78202 0.86218
0.6 0.51166 0.65121 0.78948
0.7 0.36303 0.49093 0.69527
0.8 0.20933 0.30328 0.55494
0.9 0.08807 0.11792 0.39647
0.99 c04-math-0554 c04-math-0555 0.08807

Figure 4.18 depicts the MTTF for an NFT WSN and an FT WSN containing on average c04-math-0574 and c04-math-0575 clusters (c04-math-0576) at deployment time. The figure reveals that an FT WSN improves the MTTF considerably over an NFT WSN for both number of clusters. Figure 4.18 also shows that the MTTF for an FT WSN when c04-math-0577 is always greater than the MTTF for an FT WSN when c04-math-0578. We observe that since the MTTF for an FT WSN drops nearly to that of an NFT WSN as c04-math-0579, building a more reliable FT WSN requires low failure probability sensors. This WSN MTTF variation with c04-math-0580 follows similar trends as observed in WSN clusters (Fig. 4.17) and sensor nodes (Fig. 4.16). Similar MTTF variations for a WSN, WSN cluster, and a sensor node results from our bottom-up modeling approach (Section 4.4), where each hierarchical level captures the MTTF variation trends of lower levels. We observe that the MTTF for a WSN with c04-math-0581 is always greater than the MTTF for a WSN with c04-math-0582. This observation is intuitive because WSNs with c04-math-0583 have more redundant WSN clusters (and sensor nodes) and can survive more cluster failures before reaching the failed state (c04-math-0584) as compared to a WSN with c04-math-0585.

c04f018

Figure 4.18 MTTF in days for an NFT WSN and an FT WSN with c04-math-0586 [130]

We observe the percentage MTTF improvement for an FT WSN over an NFT WSN containing on average c04-math-0587 and c04-math-0588 clusters. We observe that the MTTF improvement for both number of clusters decreases with increasing c04-math-0589 when c04-math-0590 (similar trend as for a WSN cluster and for a sensor node). The MTTF percentage improvement for an FT WSN with c04-math-0591, c04-math-0592, is 87.56% for c04-math-0593 and drops to 3.3% for c04-math-0594. Similarly, the MTTF percentage improvement for an FT WSN with c04-math-0595, c04-math-0596, is 88.2% for c04-math-0597 and drops to 3.26% for c04-math-0598. We observe that the MTTF improvement for an FT WSN with c04-math-0599 is 100% on average for all c04-math-0600 values and is greater than an FT WSN with c04-math-0601. The MTTF percentage improvement for an FT WSN with c04-math-0602 over an FT WSN with c04-math-0603 is 52.22% on average.

We also investigate iso-MTTF for WSNs. Table 4.10 depicts the MTTFs for an FT WSN and an NFT WSN for c04-math-0604 varying from 0.1 to 0.99 and when c04-math-0605 = 0. Results indicate that an NFT WSN requires nine redundant NFT WSN clusters to achieve a comparable MTTF as that of an FT WSN.

Table 4.10 Iso-MTTF for WSNs (c04-math-0606). c04-math-0607 denotes the redundant WSN clusters required by an NFT WSN to achieve a comparable MTTF as that of an FT WSN

Probability MTTF FT (days) MTTF NFT (days) NFT
c04-math-0608 c04-math-0609 c04-math-0610 c04-math-0611
0.1 2121.6 2135.7 9
0.3 627.62 631.77 9
0.5 323.28 326.12 9
0.7 186.8 188.74 9
0.99 48.387 48.71 9

4.7 Research Challenges and Future Research Directions

WSNs are susceptible to a plethora of different fault types and external attacks after the WSN's deployment. Fault detection and tolerance in WSNs is paramount especially for safety-critical WSNs such as military target applications, biomedical applications (body sensor networks), and volcanic eruption detection systems. Although there has been research done in fault diagnosis and FT in WSNs, various issues remain to be addressed. This section highlights research challenges and future research directions for development and deployment of reliable and trustworthy WSNs.

4.7.1 Accurate Fault Detection

Our Markov modeling results in Section 4.6 reveal that the fault detection algorithm's accuracy plays a crucial role in realizing FT WSNs. An accuracy close or equal to 100% and a false alarm rate close or equal to 0% is desirable for fault detection algorithms. There is a need to develop novel fault detection algorithms and/or improve the existing fault detection algorithms with the objective of attaining c04-math-0612100% fault detection accuracy.

4.7.2 Benchmarks for Comparing Fault Detection Algorithms

There is a need to develop standard benchmarks for comparing different fault detection algorithms for various fault models and topologies. The development of these benchmarks would facilitate designers to select an appropriate fault detection algorithm for WSNs with given FT requirements.

4.7.3 Energy-Efficient Fault Detection and Tolerance

Since sensor nodes have stringent resource constraints, fault detection and FT approaches for WSNs need to be energy efficient. Although initiatives have been taken in developing DFD approaches to minimize energy overhead, the area requires further research. Moreover, FT techniques developed for WSNs need to be energy efficient. The impact of added redundancy in sensor nodes to provide FT not only has economic cost overhead but can also have power consumption overhead, which needs to be considered and minimized during the design. Energy efficiency in fault detection and tolerance can also be achieved by exploiting near-threshold computing (NTC) in sensor node design. NTC refers to using a supply voltage (c04-math-0613) that is close to a single transistor's threshold voltage c04-math-0614 (generally c04-math-0615 is slightly above c04-math-0616 in near-threshold operation, whereas c04-math-0617 is below c04-math-0618 for subthreshold operation). Lowering the supply voltage reduces power consumption and increases energy efficiency by lowering the energy consumed per operation. The widespread adoption of NTC in sensor node designs for reduced power consumption requires addressing NTC challenges such as increased process, voltage, and temperature variations, subthreshold leakage power, and soft error rates.

4.7.4 Machine-Learning-Inspired Fault Detection

Machine learning-based approaches can be leveraged to classify normal and anomalous data. Since sensor nodes are resource constrained, innovative approaches are required to extract features from the sensed data. The machine learning techniques for sensor nodes need to reduce the training data size for the classifier (e.g., SVM) based on extracted features. The goal of machine-learning-inspired fault detection is to classify normal and anomalous data in or near real time. There exist some initiatives in machine-learning-inspired fault detection [110]; however, further work is required in this domain for accurate and energy-efficient fault detection.

4.7.5 FT in Multimedia Sensor Networks

Fault detection and tolerance in WMSNs is a relatively new research field [115]. FT data aggregation in WMSNs is imperative to reduce the throughput of data transmission, energy conservation, accuracy of event detection, and alleviate the interference from compromised nodes. Another research challenge in WMSNs is to ensure the trustworthiness of aggregated results. Since multimedia data sensed by sensor nodes require compute-intensive processing, multicore-based design of sensor nodes could meet the performance, energy, and FT requirements of WMSNs. Accurate fault detection, FT, reliability, and security in WMSNs require further research endeavors.

4.7.6 Security

Security is critical for many WSNs, such as military target tracking and biomedical applications. Security aspects are crucial to consider in developing a reliable, robust, and trustworthy WSNs. Since sensor nodes use wireless communication, WSNs are susceptible to eavesdropping, message injection, replay, and other attacks. In many situations, an adversary is capable of deploying malicious nodes in the WSN or compromising some legitimate sensor nodes. Since the sink node is a gateway between a WSN and the outside world, compromising the sink node could render the entire WSN useless. Hence, security integration in sink nodes is imperative. The security integration objective in WSNs is to provide confidentiality, integrity, authenticity, and availability of all messages in the presence of adversaries. In a secure and reliable WSN, every eligible receiver should receive all messages intended for that receiver. Furthermore, the receiver should be able to verify the integrity of each message as well as the identity of the sender.

Integration of security and privacy in sensor nodes is challenging because of the sensor nodes' resource constraints in terms of computation, communication, memory/storage, and energy supply. The security mechanisms for prevalent end-to-end systems (e.g., secure socket layer (SSL)) are not directly applicable to WSNs. In traditional end-to-end systems, it is neither necessary nor desirable for the contents of the message (except the message headers) to be available to the intermediate routers. On the contrary, in WSNs, the dominant traffic pattern is many-to-one (i.e., from sensor nodes to the sink node) [138]. Each intermediate sensor node in a WSN also needs access to message contents to perform in-network processing, such as data aggregation and data compression, to conserve energy. Security incorporation in WSNs has several aspects, some of which are discussed below.

Time Synchronization: Due to the collaborative nature of sensor nodes, secure time synchronization is important for various WSN operations, such as coordinated sensing tasks, sensor scheduling, target tracking, data aggregation, and time division multiple access (TDMA) medium access control. The network time protocol (NTP), which is used for synchronization in the Internet, cannot be directly used by WSNs due to a sensor node's constrained resources. Sometimes, synchronization protocols for WSNs have been proposed in the literature such as reference-broadcast synchronization (RBS) [139] and timing-sync protocol for sensor networks (TPSN) [140]; however, security is not considered in the design of these synchronization protocols.

Sensor Location Discovery: Sensor node location is important for many WSN applications, such as target tracking and environmental monitoring. The geographical routing protocols developed for WSNs also require sensor node location information to make routing decisions. Sensor location discovery protocols make use of special nodes known as beacon nodes, which are assumed to know their own location via global positioning system (GPS) or manual configuration. Nonbeacon nodes receive radio signals (reference signals) from beacon nodes. These reference signals contain the location of beacon nodes. The nonbeacon nodes then determine their own location based on features of the reference messages, such as using received signal strength indicator (RSSI). Without any protection, the adversary has the potential to mislead the location estimation at sensor nodes. For example, an attacker may provide incorrect location references by replaying the beacon signals intercepted in different locations. A more serious attack launched by an adversary could be compromising a beacon node and distributing malicious location references by manipulating the beacon signals (e.g., changing the signal strength if RSSI is used to estimate the distance). In the presence of such attacks, nonbeacon nodes will determine their location incorrectly. There are a few methods to counter these attacks and to increase the reliability and trustworthiness of WSN. For instance, voting-based location references use voting of different received location references from different beacon nodes; however, this technique would increase the cost of the WSN as the approach would require more beacon nodes equipped with GPS.

Secure Routing: An adversary can also impact routing within a WSN, which consequently degrades the reliability of the WSN. Secure routing protocols for MANETs or wired networks are not suitable for WSNs because these protocols are computationally expensive and are meant for routing between any pair of nodes and not for the many-to-one traffic pattern prevalent in WSNs.

Intrusion Detection: Most of the existing WSNs are designed without considering security and intrusion detection. WSNs are more prone to malicious intruders as compared to traditional wired networks due to an open medium, a low degree of physical security, the dynamic topology, and a limited power supply. Security vulnerabilities in WSNs can be alleviated using an intrusion detection system (IDS). Intrusion detection is a security technology that attempts to identify those who are trying to break into and misuse a system without authorization as well as those who have legitimate access to the system but are abusing their privileges [141]. By modeling the behavior of normal sensor node activities and operations, an IDS can identify potential intruders and maintain trustworthy operation of a WSN. IDSs are of two types: misuse-based detection and anomaly-based detection. A misuse-based IDS encodes known attack signatures and vulnerabilities and stores this information in a database. An alarm is generated by the IDS if a discrepancy exists between the current activities and stored signatures. A misuse-based IDS cannot detect novel attacks because of the lack of corresponding signatures. An anomaly-based IDS creates normal profiles of sensor node behavior and compares them with current activities. If a significant deviation in a sensor node's behavior is observed, the anomaly-based IDS raises an alarm. An anomaly-based IDS can detect unknown attacks; however, developing profiles of a sensor node's normal behavior is challenging.

4.7.7 WSN Design and Tuning for Reliability

WSNs are more susceptible to failure than wired networks due to a WSN's deployment in unattended and often hostile environments. The vulnerability of sensor nodes to failures and attacks from malicious intruders requires reliability and security incorporation in the design of sensor nodes and WSNs. For instance, in the case of failures of some sensor nodes or some sensor nodes being compromised by an adversary, a WSN should be able to detect these failures and then adapt accordingly to remain operational. For example, if a few nodes are diagnosed as misbehaving or failed, routing protocol should be able to adapt to the situation and find alternative routes. This adaptation may require some sensor nodes to increase their transmission energy to get their sensed information to healthy nodes. Furthermore, the frequency of packet transmission can be reduced in such situations to conserve energy of sensor nodes to compensate for the increased transmission energy. This tuning of sensor nodes and the WSN would maintain the reliable operation of WSN although at a reduced efficiency. Moreover, the WSN manager should be alarmed of the situation to take necessary actions to increase the efficiency of the WSN, including removal of identified malicious nodes and placement of new healthy sensor nodes in the WSN.

4.7.8 Novel WSN Architectures

To maintain reliable and secure operation of WSNs in the presence of failures and malicious intruders, innovative WSN architectures are desired. For instance, a heterogeneous hierarchical multicore embedded wireless sensor network (MCEWSN) can better meet performance and reliability requirements of WSN [142]. The heterogeneity in the architecture subsumes the integration of numerous single-core embedded sensor nodes and several multicore embedded sensor nodes. A hierarchical architecture comprising of various clusters and a sink node is appropriate for large WSNs since in small WSNs, sensor nodes can send the sensed data directly to the sink node. Each cluster comprises several leaf sensor nodes and a cluster head. Leaf sensor nodes embed a single-core processor and are responsible for sensing, preprocessing, and transmitting the sensed data to the cluster head nodes. Cluster head nodes embed a multicore processor and are responsible for coalescing/fusing the data received from leaf sensor nodes for transmission to the sink node. Multicore embedded sensor nodes enable energy savings over conventional single-core embedded sensor nodes in two ways. First, multicore embedded sensor nodes reduce the energy expended in communication by performing in situ computation of sensed data and transmitting only processed information. Second, a multicore embedded sensor node permits the computations to be split across multiple cores while running each core at a lower processor voltage and frequency, as compared to a single-core system, which results in energy savings. Multicore cluster heads are also amenable for supervising a fault diagnosis algorithm within the cluster. Additionally, security primitives could be integrated in these multicore cluster heads to thwart attacks from malicious intruders or malicious sensor nodes.

4.8 Chapter Summary

This chapter provided a comprehensive research on modeling and analysis of fault detection and tolerance in WSNs. To elucidate fault detection in WSNs, we provided a taxonomy for fault diagnosis in WSNs. We simulated prominent fault detection algorithms using nsc04-math-06192 to determine the algorithms' accuracies and false alarm rates. We further analyzed the effectiveness of fault detection algorithms under conditions modeled from real-world data. We proposed an FT duplex sensor node model based on the novel concept of determining the coverage factor using a sensor node's fault detection algorithm's accuracy. We developed comprehensive Markov models that hierarchically encompass sensor nodes, WSN clusters, and the overall WSN. Our Markov models characterized WSN reliability and MTTF for different sensor failure probabilities. Our models could assist WSN designers to better meet application requirements by determining the reliability and MTTF in the predeployment phase.

Our nsc04-math-06202 simulation results of fault detection algorithms suggested that both the simulated and real-world data needed to be considered for rigorous evaluation of fault detection algorithms. Although some algorithms performed better than other algorithms on simulated data, these algorithms performed inferior to other algorithms on real-world data. We observed that the fault detection algorithm's accuracy played a crucial role in FT WSNs by comparing our results against a perfect fault detection algorithm (c04-math-0621). The results indicated that the percentage improvement in reliability for an FT sensor node with c04-math-0622 over an NFT sensor node is 230% and the percentage improvements in reliability for an FT sensor node with c04-math-0623 over an FT sensor node (c04-math-0624) with an average number of neighbors c04-math-0625 equal to 5, 10, and 15 was 172.36%, 166.31%, and 165.32%, respectively, for c04-math-0626. The percentage improvement in reliability for an FT WSN cluster with c04-math-0627 over an NFT WSN cluster and an FT WSN cluster with c04-math-0628 was 601.12% and 142.73%, respectively, for c04-math-0629. The percentage improvement in reliability for an FT WSN with c04-math-0630 over an NFT WSN and an FT WSN with c04-math-0631 was 350.18% and 236.22%, respectively, for c04-math-0632.

Results indicated that our FT model could provide on average a 100% MTTF improvement with a perfect fault detection algorithm, whereas the MTTF improvement varied from 95.95% to 1.34% due to a fault detection algorithm's typical poor performance at high sensor failure rates. We also observed that the redundancy in WSNs plays an important role in improving WSN reliability and MTTF. Results revealed that just three redundant sensor nodes in a WSN cluster resulted in an MTTF improvement of 103.35% on average. Similarly, redundancy in WSN clusters contributes to the reliability and MTTF improvement, and the results indicated that three redundant WSN clusters could improve the MTTF by 52.22% on average. The iso-MTTF results indicated that an NFT WSN cluster required three redundant sensor nodes to attain a comparable MTTF as that of an FT WSN cluster. Iso-MTTF results further indicated that an NFT WSN required nine redundant WSN clusters to achieve a comparable MTTF as that of an FT WSN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset