Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4
Modeling and Analysis of Fault Detection and Fault Tolerance in Embedded Wireless Sensor Networks^*

Wireless sensor networks (WSNs) consist of spatially distributed autonomous sensor nodes that collaborate with each other to perform an application task. A sensor node comprises a sensing unit (a sensing unit contains sensors such as temperature and humidity sensors), a processing unit, a storage unit, a communication unit, a power unit, an actuator unit, and an optional location finding unit [44].

Figure 4.1 shows the WSN architecture that we consider in this chapter [81–83]. The sensor nodes distributed in the sensor field gather information (sensed data or statistics) about an observed phenomenon (e.g., environment, target) using attached sensors. A group of sensor nodes, located geographically close to each other, is called a WSN cluster. Each WSN cluster has one cluster head (WSN cluster formation, and cluster head determination/maintenance is beyond the scope of this chapter). Sensed data within a WSN cluster is collected by the cluster head and is relayed to a sink node (or base station) via the sensor nodes' ad hoc network. The sink node transmits the received information back to the WSN designer and/or WSN manager via a gateway node connected to a computer network. The WSN designer is responsible for designing the WSN for a particular application to meet application requirements such as lifetime, reliability, and throughput. After WSN design and deployment, the WSN manager manages WSN operations, such as data analysis, monitoring alive and dead sensor nodes, and alarm conditions (e.g., forest fire, volcano eruption).

c04f001 — **Figure 4.1** Wireless sensor network architecture

WSN research and design have gained significance in recent years because of the increasing proliferation of WSNs in a wide variety of application domains. These application domains include, but are not limited to, mission-critical (e.g., security, defense, space, satellite) or safety-related (e.g., health care, active volcano monitoring) systems. The utility of WSNs for safety-related systems can be illustrated by a volcano monitoring example. Studying active volcanoes typically requires sensor arrays to collect seismic and infrasonic (low-frequency acoustic) signals. Distributed sensor nodes in a WSN provide a correlated spatial distribution of seismic and infrasonic events that greatly facilitate scientific studies of wave propagation phenomena and volcanic source mechanisms [84]. The analytics of seismic and infrasonic data obtained by a WSN can help the scientists in better prediction of volcanic events, such as volcanic eruption and earthquakes. Although WSNs can benefit numerous applications, the realization of WSNs for an application requires addressing various design challenges.

A crucial challenge in WSN design is to meet varying application requirements (e.g., lifetime, throughput, reliability), given the limited energy, computation, and storage available on sensor nodes. Sensor nodes are often deployed in hostile environments, such as forests, the ocean floor [85], animal habitats [86], and active volcanoes [84], that further exacerbates the design challenge of meeting application requirements. Unattended and/or hostile deployment environments make sensor nodes more susceptible to failures than other systems [84], and manual inspection of sensor nodes after deployment can be impractical. Although faults can occur in any of the sensor node components, sensors and actuators have significantly higher fault rates than other semiconductor-based systems. For example, NASA aborted the launch of space shuttle Discovery [87] because of a sensor failure in the shuttle's external tank [88]. Sensor failure can occur due to deployment impact, accidents (animal, vehicular), fire, extreme weather, or aging [89]. Failed sensor nodes may result in sensor network partitioning (i.e., sensor nodes become isolated and are disconnected from the sensor network), reduced WSN availability, and WSN failure. In order to meet application requirements in the presence of sensor failures, incorporation of fault detection and fault tolerance (FT) mechanisms in WSNs is imperative.

While traditional reliability models can be readily applied to FT systems, faults do not always result in failure. Transient, malicious errors occur due to noise, voltage levels, and broken components. Therefore, it is imperative to model the ability for fault detection to provide coverage against such faults to protect mission-critical data. Fault detection includes distributed fault detection (DFD) algorithms that identify faulty sensor readings to diagnose a faulty sensor. DFD algorithms do not incur additional transmission cost as they use existing network traffic to identify sensor failures. The accuracy of a fault detection algorithm signifies the algorithm's ability to accurately identify faults. We analyze and study two well-cited fault detection algorithms and compare their performance under realistic conditions. We have implemented and simulated the DFD algorithms incorporating the protocol stack and a realistic WSN topology as opposed to only algorithm simulation that was done in most previous works [90, 91].

Although fault detection helps in isolating faulty sensors, FT incorporation in WSNs is imperative to reliably accomplish application tasks. A prominent FT technique is to add redundant hardware and/or software [92]. Stringent design constraints (e.g., power, cost) differentiate WSNs from other systems, and consequently the added redundancy for FT must justify the additional cost. Since sensors (e.g., temperature, light, motion) attached to sensor nodes have comparatively higher fault rates than other components (e.g., processor, transceiver) [88, 93, 94], sensor redundancy would be the most effective to enhance FT capability of sensor nodes. Fortunately, sensors are cheap and adding redundant spare sensors would add little to the individual sensor node's cost.

Although FT is a well-studied research field [95–100], fault diagnosis and FT for WSNs are relatively unstudied. Varying FT requirements across different applications increase the complexity of fault detection and FT for WSNs. For instance, mission-critical applications have relatively high reliability requirements as compared to non-mission-critical applications such as ambient conditions monitoring. To the best of our knowledge, there exists no sensor node model to provide better reliability for mission-critical applications. Since applications are typically designed to operate reliably for a certain period of time (i.e., lifetime requirement), FT metrics such as reliability and mean time to failure (MTTF) need to be considered in WSN design. WSN designers require a model to estimate FT metrics and determine necessary redundancy during design time. Unfortunately, the literature provides no rigorous mathematical model that provides insights into WSN reliability and MTTF. Previous works study fault detection and FT in isolation, and their synergistic relationship has not been investigated in the context of WSNs.

This chapter's highlights are as follows:

Fault diagnosis in WSNs through various fault detection algorithms.
Simulation of fault detection algorithms using ns $c04-math-0001$ 2 to determine the algorithms' accuracy and false alarm rates. ns $c04-math-0002$ 2 is a discrete event simulator that provides support for the simulation of networks and routing protocols over wired and wireless networks [101]. We further analyze the effectiveness of fault detection algorithms under conditions modeled from the real-world data.
Proposal of an FT sensor node model consisting of duplex sensors (i.e., one active sensor and one inactive spare sensor) that exploits the synergy of fault detection and FT. Although our model can be extended for sensors with N-modular redundancy (NMR) [92], we suggest duplex sensor nodes to minimize the additional cost.
Characterization of FT parameters, such as coverage factor, by exploiting the synergy between fault detection and FT.
Proposal of hierarchical Markov models to characterize WSN reliability and MTTF. For the first time, we delineate reliability and MTTF hierarchically at the sensor node, WSN cluster, and WSN levels (Fig. 4.1).
Determination of iso-MTTF (isoreliability) for WSN clusters.¹ We define iso-MTTF (isoreliability) as how many redundant sensor nodes a non-fault tolerance (NFT) WSN cluster requires over an FT WSN cluster to achieve an equal MTTF (reliability).
Research challenges and future research directions for the design and deployment of reliable and trustworthy WSNs.

In our duplex sensor model, we assume that the redundant sensor is in a cold standby mode, which ideally consumes no power. Although we focus on sensor failures within the sensor node in this study due to higher sensor failure rates as compared to other sensor node components [88, 93], our model can be extended to include failures for other components within the sensor node such as the processor and transceiver. Our FT WSN modeling serves as a first step toward FT WSN modeling and can potentially facilitate further research in FT WSN modeling and evaluation.

Our Markov models are comprehensive and characterize sensor nodes, WSN clusters, and overall WSN reliability and MTTF. The proposed Markov modeling enables WSN designers to investigate the effects of different types of FT sensor nodes (e.g., duplex, NMR), number of WSN clusters, and the number of sensor nodes in the cluster on the FT of the overall WSN. Our hierarchical WSN Markov models enable designers to select an efficient WSN topology to better meet application requirements.

The remainder of this chapter is organized as follows. Section 4.1 gives a review of related work. Section 4.2 elaborates on fault diagnosis in WSNs. Section 4.3 discusses two fault detection algorithms to elucidate the fault diagnosis mechanism in WSNs. Section 4.4 describes our Markov models for characterizing WSN reliability and MTTF. Implementation and simulation of fault detection algorithms using ns $c04-math-0003$ 2 are presented in Section 4.5. Numerical results from our Markov modeling are presented in Section 4.6. Section 4.7 highlights research challenges and future research directions for the design and deployment of reliable and trustworthy WSNs. Finally, Section 4.8 concludes the chapter.

4.1 Related Work

Despite fault detection and FT being well-studied research fields [95, 98, 100], little work exists in WSN fault detection and FT. This section summarizes some of the previous works in the literature related to fault detection and FT.

4.1.1 Fault Detection

Jiang [102] proposed a DFD scheme that detected faulty sensor nodes by exchanging data and mutually testing neighboring nodes. Jian-Liang et al. [103] proposed a weighted median fault detection scheme that used spatial correlations among the sensor measurements (e.g., temperature, humidity). Lee and Choi [104] presented a DFD algorithm that identified faulty sensor nodes based on comparisons between neighboring sensor nodes' data. The DFD algorithm used time redundancy to tolerate transient faults in sensing and communication. Khilar and Mahapatra [105] proposed a probabilistic approach to diagnose intermittent WSN faults. The simulation results indicated that the DFD algorithm's accuracy increased as the number of diagnostic rounds increased where neighboring sensor nodes exchanged measurements in each round.

Ding et al. [91] proposed algorithms for faulty sensor identification and FT event boundary detection. The algorithms considered that both the faulty sensors and normal sensors in an event region could generate abnormal readings (readings that deviate from a typical application-specific range). Krishnamachari and Iyengar [106] proposed a distributed Bayesian algorithm for sensor fault detection and correction. The algorithm considered that measurement errors due to faulty equipment are likely to be uncorrelated. Wu et al. [107] presented a fault detection scheme in which the fusion center (the node that aggregated data from different nodes) attempted to identify faulty sensor nodes through temporal sequences of received local decisions using a majority voting technique. Lo et al. [108] presented a distributed, reference-free fault detection algorithm based on local pairwise verification between sensor nodes monitoring the same physical phenomenon. The authors observed that a linear relationship existed between the outputs of a pair of sensor nodes that could be exploited to detect faulty sensor nodes. Since the fault detection was done pairwise, the algorithm was able to conserve energy. Results revealed that the proposed algorithm could achieve a detection accuracy of 84% and a false alarm rate of 0.04%.

Miao et al. [109] proposed agnostic diagnosis for detecting silent failures (i.e., failures with unknown types and symptoms). The proposed detection technique exploited the fact that a sensor node's metrics (e.g., radio on-time, number of packets transmitted in a time interval) exhibited certain correlation patterns, violation of which indicated potential silent failures. The detection accuracy of the proposed scheme was close to 100% for small WSNs, whereas the detection accuracy decreased sharply and false alarm rate increased as the WSN size increased. The proposed technique required a sink node to collect data from all the sensor nodes in the WSN, which resulted in rapid energy depletion of the sink node and sensor nodes near the sink node. Furthermore, the fault detection latency of the proposed scheme was high due to its centralized nature.

There exists some work related to anomaly detection in WSNs. Bhargava and Raghuvanshi [110] proposed a method for anomaly detection in WSNs based on S-transform (an extension of continuous wavelet transform). The S-transform extracted the features from sensor nodes' data set, which were used for training of support vector machine (SVM). The SVM was then used for the classification of normal and anomalous data. Results revealed that the proposed scheme's accuracy for data classification ranged from 78% to 94%. Salem et al. [111] proposed an anomaly detection algorithm for medical WSNs. The proposed algorithm first classified instances of sensed patient attributes as normal and abnormal. The algorithm then used regression prediction to discern between a faulty sensor reading and a patient entering into a critical state. The results demonstrated the algorithm's ability to achieve a relatively low false alarm rate ( $c04-math-0004$ 1%) and a good detection accuracy (the attained accuracy was not specified) for the medical WSN application.

4.1.2 Fault Tolerance

In work related to FT for WSNs, Koushanfar et al. [94] proposed an FT scheme that provided backup for one type of sensor using another type of sensor, but they did not propose any FT model. Clouqueur et al. [112] presented algorithms for collaborative target detection in the presence of faulty sensors. Chiang et al. [113] designed system-level test interfaces for remote testing, repair, and software upgrade for sensor nodes. The authors evaluated the proposed test interfaces on Texas Instrument's MSP430 microcontroller-based sensor nodes. Experimental results revealed that the test interfaces with double, triple, and quadruple redundancies increased the WSN's availability.

Krasniewski et al. [114] proposed a protocol to diagnose and mask arbitrary sensor node failures in an event-driven WSN. The protocol considered a clustered WSN with rotating cluster heads. The protocol assigned each sensor node a trust index to indicate the sensor node's track record in reporting past events correctly. The cluster head analyzed the event reports using the trust index and made event decisions. The authors simulated the protocol using ns $c04-math-0005$ 2 and observed that the protocol was able to detect events correctly even if more than 50% of the sensor nodes were compromised. Sun et al. [115] introduced a trust mechanism to evaluate the trustworthiness of a sensor node and the sensor node's data. The authors defined memory depth to characterize the temporal correlation and aggregation bandwidth to characterize spatial correlation. A sensor node used memory depth to calculate self-trustworthiness of the sensor node's data based on stored historical/past data and the current data. When a sensor node reported its sensed data to the aggregator, the aggregator's trustworthiness to the reported result depended both on the trustworthiness of the sensor node reporting data and the trustworthiness of the sensor node to its sensed data. Subsequently, each aggregator reported the aggregated result and the aggregator's self-trust opinion to the upper layer aggregator progressively, and so on until the aggregated result reached the sink node. Results demonstrated the effectiveness of the proposed mechanism for continuous media streaming and discrete data in wireless multimedia sensor networks (WMSNs).

There exists some work on providing FT in WSNs by deploying relay nodes and considering connectivity as an FT metric. Relay nodes communicate with sensor nodes, other relay nodes, and sink nodes to prolong WSN lifetime. Zhang et al. [116] developed approximation algorithms for determining a minimum number of relay nodes along with the relay nodes' placement to achieve certain connectivity requirements. Han et al. [117] considered the problem of deploying relay nodes to provide FT in heterogeneous WSNs where sensor nodes had different transmission radii. They developed approximation algorithms for full and partial FT relay node placement. Full FT relay node placement deployed a minimum number of relay nodes to establish disjoint paths between every sensor and/or relay node pair, whereas partial FT relay node placement only considered sensor node pairs. Baldi et al. [118] evaluated gossip algorithms, which are distributed algorithms that distribute computational burden across all nodes and considered connectivity as an FT metric.

Sen et al. [119] introduced region-based connectivity: an FT metric defined as the minimum number of nodes within a region whose failure would disconnect the network. Alwan and Agarwal [120] provided a survey of FT routing techniques in WSNs. Souza [121] presented a framework for failure management in WSNs, focusing on fault diagnosis and recovery techniques. The presented FT framework mitigated the failure propagation in a business (enterprise) environment by implementing different FT techniques.

4.1.3 WSN Reliability Modeling

There exists some work on FT WSN modeling. Cai et al. [122] presented a reliability model to prolong the network lifetime and availability based on connectivity and coverage constraints. Zhu and Papavassiliou [123] presented a model that characterized sensor connectivity and investigated the trade-offs among sensor node connectivity, power consumption, and data rate. They also discussed the impact of sensor connectivity on system reliability. Vasar et al. [124] presented Markov models for WSN reliability analysis. They presented a reliability comparison for various numbers of defective components' replacements with hot-standby redundant components. Xing and Michel [125] presented WSN reliability and security modeling in an integrated manner. Their modeling technique differentiated two types of WSN failures: security failures due to malicious intrusions and traditional failures due to malfunctioning components.

Moustapha and Selmic [88] used recurrent neural networks to model sensor node dynamics for sensor fault detection. Their network model corresponded to the WSN topology such that the recurrent neural network input was taken from the modeled sensor node and neighboring sensor nodes. Kannan and Iyengar [126] developed a game-theoretic model of reliable length- and energy-constrained routing in WSNs. They showed that optimal length-constrained paths could be computed in polynomial time in a distributed manner using geographic routing. Mukhopadhyay et al. [127] presented a method that used sensor data properties to enable reliable data collection. The method consisted of predictive models based on temporal correlation in sensor data. They demonstrated that their method could handle multiple sources of errors simultaneously and could correct transient errors arising in sensor node hardware and wireless communication channels.

Although DFD algorithms were proposed in the literature for detecting sensor faults, the fault detection was not leveraged to provide FT. There exists work in the literature regarding FT in WSNs [94, 116, 117, 119, 122]; however, FT metrics such as reliability and MTTF were not investigated rigorously in the context of WSNs. Specifically, reliability and MTTF were not considered hierarchically at the sensor node, WSN cluster, and WSN level. This hierarchical characterization of FT metrics facilitates a WSN designer to determine an appropriate WSN topology (i.e., number of clusters and sensor nodes with required FT capability in each cluster).

4.2 Fault Diagnosis in WSNs

The sensor nodes in WSNs are often deployed in unattended and possibly hostile environments that make sensor nodes susceptible to failures. Faults in sensor nodes can occur at various levels, such as processing unit, transceiver, storage unit, power unit, sensors, and actuators. Since faults are inevitable, it is imperative to determine faulty and fault-free sensor nodes in the WSN. Traditional fault diagnosis techniques that are developed for multiprocessor systems are not directly applicable to WSNs due to the sensor nodes' stringent resource constraints. This section describes sensor faults and then elaborates on fault diagnosis in WSNs.

4.2.1 Sensor Faults

The faults in sensor nodes include permanent faults where a node becomes dead and Byzantine faults where the node behaves arbitrarily or maliciously. Figure 4.2 shows an example of Byzantine faults where sensor nodes A–E are equipped with temperature sensors and process the measured reading to give temperature in Fahrenheit. The sensor nodes send the sensor node's sensed temperature readings to other neighboring nodes. The temperature sensor in the sensor node D behaves maliciously and sends inconsistent and arbitrary temperature values to other nodes (60, 100, 150, and 10 to A, B, C, and E, respectively). The rest of the sensor nodes send consistent values to other neighboring nodes. As a consequence, nonfaulty sensor nodes obtain different global information about the temperature in the region and may conclude on a false value of the overall temperature of the region. The fault detection algorithm needs to be robust to such inconsistent behavior that can jeopardize the collaboration in the WSN [112].

c04f002 — **Figure 4.2** Byzantine faulty behavior in WSNs

Several efforts have been made to classify malicious sensor behavior [93, 128]. While different sensor fault taxonomies have been proposed, three particular behaviors are targeted by DFD methods. The outlier faults occur when a sensor signal spikes in value (Fig. 4.3(a)) and can be detected by comparing sensor readings to previous readings or neighboring sensor nodes' readings. The stuck-at faults occur when the output remains at a constant level (Fig. 4.3(b)). The noisy faults occur when the signal-to-noise ratio of the sensor's output is low, resulting in random data (Fig. 4.3(c)). An effective fault detection algorithm must be able to identify the broadest possible range of malicious output while minimizing fault positives. Because the broad range of sensor failures result in a mixed detection rate for different algorithms, accurate simulation is important for studying sensor failures [93].

c04f003 — **Figure 4.3** Various types of sensor faults [93]: (a) outlier faults; (b) stuck-at faults; (c) noisy faults

4.2.2 Taxonomy for Fault Diagnosis Techniques

A fault diagnosis system is a monitoring system that detects faulty sensor nodes and their location in the WSN. Fault diagnosis techniques are classified based on the nature of tests, correlation between sensor readings, and characteristics of sensor nodes [129]. The key component of a fault diagnosis system is a fault detection algorithm, and the key terminology includes correctness, latency, detection accuracy, and false alarm rate. A fault diagnosis is correct if no fault-free sensor nodes are mistakenly diagnosed as faulty. Latency is defined as the time elapsed since the appearance of the fault to the isolation of the faulty sensor node. Detection accuracy is defined as the ratio of the number of faulty sensor nodes detected to the actual number of faulty sensor nodes in the network. The accuracy of the fault detection algorithm is a crucial factor in maintaining reliable WSN operation. The ratio of the number of fault-free sensor nodes diagnosed as faulty to the actual number of fault-free sensor nodes is the false alarm rate. Fault detection techniques for WSN can be broadly categorized into centralized and distributed approaches [129].

4.2.2.1 Centralized Approaches

In a centralized approach, a geographically or logically centralized arbiter (i.e., sink node or base station) with higher computational power, larger memory size, and a greater energy supply than ordinary sensor nodes is responsible for fault detection and diagnosis of the overall WSN. The sink node periodically sends diagnostic queries into the network to obtain the state of the individual sensor nodes. The sink node then analyzes the received diagnostic response messages to determine faulty and fault-free sensor nodes. Centralized approaches are accurate; however, these approaches cannot be scaled for large-scale WSNs. The scalability of centralized approaches is limited because it is expensive for the sink node to accumulate and analyze diagnostic information from all the sensor nodes in the WSN. Furthermore, centralized approaches lead to rapid energy depletion in certain regions of the network, in particular, nodes closer to the sink node. The energy depletion in sensor nodes closer to the sink node causes network partitions, which results in unmonitored areas in the WSN. Moreover, the detection latency of centralized approaches is large due to multihop communication. Owing to these limitations of centralized approaches, distributed approaches are highly preferable in WSNs.

4.2.2.2 Distributed Approaches

In distributed approaches, each sensor node executes the fault detection algorithm and generates a localized fault view. The localized fault view is a sensor node's view regarding the fault states of the sensor node's one-hop neighbors. This localized fault view is then disseminated in the network such that each fault-free sensor node correctly diagnoses the state of all the sensor nodes in the network. Distributed approaches conserve the sensor nodes' energy and consequently prolong the WSN lifetime. Distributed approaches can be classified into various types [129]:

Test-Based Approaches: In test-based approaches, different tests (tasks) are assigned to sensor nodes and faulty sensor nodes are identified based on test results. Test-based approaches are further classified into invalidation-based and comparison-based approaches.

In invalidation-based approaches, every sensor node tests a set of sensor nodes. Each sensor node then passes the test results to other sensor nodes based on which a consensus is made on the faulty sensor nodes. Each sensor node must be tested by at least t one-hop neighbors to achieve a t-diagnosability. The greater the t, the greater the message overhead. In most of invalidation-based approaches, the number of faulty sensor nodes that can be detected is upper bounded by t. The detection accuracy of these approaches is close to 100% and the false alarm rate is close to zero; however, the detection latency exhibited by most of these approaches is high.

In comparison-based approaches, different tests (tasks) are assigned to a pair of sensor nodes and the results of these tasks are compared. The agreement and disagreement between these results form the basis for diagnosing faulty sensor nodes. Comparison-based approaches have comparatively less message and time complexity overhead than the invalidation-based approaches.

Neighbor Coordination Approaches: In neighbor coordination approaches, one-hop neighbors coordinate with each other to determine whether or not to disregard their own sensor readings based on the neighboring sensors readings or on the weights based on physical distances from the event and trustworthiness of the sensor node's measurements. Neighbor coordination approaches are further categorized into majority voting and weighted majority voting approaches.

Majority voting approaches exploit the fact that the faulty measurements are uncorrelated while nonfaulty measurements are spatially correlated. For example, assume that a sensor node $c04-math-0006$ is a neighbor of $c04-math-0007$ , and $c04-math-0008$ and $c04-math-0009$ are the readings for $c04-math-0010$ and $c04-math-0011$ , respectively. Sensor readings $c04-math-0012$ and $c04-math-0013$ are similar when $c04-math-0014$ where $c04-math-0015$ is application dependent and typically a small number. Majority voting techniques can provide good detection accuracy and low false alarm rates. However, majority voting techniques are topology dependent as the effectiveness of these approaches depends on the node degree in the WSN.

Weighted majority voting approaches weigh the sensing measurements based on properties such as physical distance from the event and trustworthiness of the measurements. Unlike simple majority voting, these weighted measurements are used to decide the state of a sensor node (i.e., faulty or nonfaulty) in a WSN. Weighted majority voting approaches are computationally more expensive than simple majority voting approaches. Weighted majority voting approaches are also topology dependent such as simple majority voting approaches (i.e., majority voting approaches accuracy degrades drastically in sparse networks).

Hierarchical Detection: In hierarchical detection, a spanning tree with the sink node as the root node is constructed that spans all fault-free sensor nodes in the WSN. This spanning tree is then used to determine faults at each level of the tree. The fault results are disseminated across the spanning tree such that each node in the network correctly diagnoses the fault states of all the sensor nodes in the WSN. The detection latency in a hierarchical detection approach is high because the diagnosis process is started either by the sink node or the leaf nodes and requires multihop communication of diagnostic messages. Furthermore, similar to the centralized approach, the hierarchical detection approach leads to rapid energy depletion in certain regions of the WSN, in particular, in sensor nodes close to the sink.

Node Self-Detection: In sensor node self-detection, the sensor node detects its own status (faulty or nonfaulty) with the help of additional hardware incorporated in the sensor node. Since the sensor node self-detection approach requires additional hardware, the approach increases hardware complexity, weight, and energy consumption.

Clustering-Based Approaches: Clustering-based approaches divide the overall WSN into clusters, which are groups of sensor nodes located geographically close to each other. Each cluster head executes a fault detection algorithm in the cluster head's group using a centralized or distributed approach. Clustering-based diagnostic approaches are communication-efficient; however, the cluster head requires more energy for leading the diagnostic process.

Watchdog Approaches: In watchdog approaches, a sensor node monitors whether the sensor node's packets are forwarded by the sensor node's one-hop neighbor by overhearing the communication channel. If a sensor node's neighboring node does not forward a packet within a certain time, the neighboring node is viewed as misbehaving. When the misbehaving rate exceeds a threshold, the misbehaving node is diagnosed as faulty. The source node then sends packets along other routes avoiding the diagnosed faulty node.

Probabilistic Approaches: Probabilistic approaches exploit the fact that sensor failure probability for different sensor nodes is not identical in a time interval that is not infinitesimally small. Probabilistic approaches normally leverage Bayesian or other statistical methods for fault diagnosis.

Event-Driven Diagnosis: In event-based WSNs as opposed to data-driven or query-driven WSNs, sensor nodes only report events of interest to the sink node in a timely manner. Fault diagnosis in event-driven WSNs requires special consideration for fault-event disambiguation since an event also causes abnormal data to be sensed by the sensor nodes that could be interpreted as a faulty measurement. Event detection approaches work effectively in dense networks with relatively low fault probability of sensor nodes. Most of the event detection approaches fail in distinguishing between events and faults if faults are located at the event boundary.

4.3 Distributed Fault Detection Algorithms

DFD algorithms are those in which each sensor node executes a fault detection algorithm to diagnose the status of the sensor node's neighboring sensor nodes and to obtain a localized fault view (Section 4.2.2.2). The localized fault view is then shared with neighboring sensor nodes to reach a consensus on faulty sensor nodes. We analyze different fault detection algorithms [90, 91, 102, 103, 106]; however, we present two fault detection algorithms for explaining the fault detection algorithms' functionalities:

4.3.1 Fault Detection Algorithm 1: The Chen Algorithm

The first algorithm, which we refer to as the Chen algorithm [90], relies on simple majority counts to determine faults. The approach is practical because sensor nodes have limited computing ability; hence, lightweight calculations are preferred over intensive calculations. Each sensor node $c04-math-0016$ can exist in four states: good (GD), faulty (FT), likely good (LG), or likely faulty (LF). A sensor node in likely states is unsure if the sensor node is faulty and uses the neighbors' states of sensor nodes to finalize the sensor node's decision. The algorithm's implementation complexity is low and the probability of correct diagnosis is high.

Table 4.1 lists the notations used in the Chen algorithm [90]. The neighboring sensor nodes in a WSN are those that are within the transmission range of each other. Each sensor node sends the sensor node's sensed values to the sensor node's neighboring sensor nodes. A sensor $c04-math-0044$ generates test result $c04-math-0045$ based on the sensor node's neighbor $c04-math-0046$ 's measurements using variables $c04-math-0047$ , $c04-math-0048$ , and threshold values $c04-math-0049$ and $c04-math-0050$ . The sensors $c04-math-0051$ and $c04-math-0052$ are both good or both faulty if $c04-math-0053$ ; however, the sensor nodes have a different good or faulty status if $c04-math-0054$ . Each sensor node sends the sensor node's tendency values (i.e., estimated state) to all the sensor node's neighboring sensor nodes. The test values from neighboring sensors determine the tendency value of sensors to be LG or LF, and the number of LG sensors with the same results determines whether the sensors are GD or FT. Precisely, $c04-math-0055$ and $c04-math-0056$ , $c04-math-0057$ must be greater than or equal to $c04-math-0058$ for $c04-math-0059$ to be identified as good. The algorithm diagnoses a good sensor $c04-math-0060$ as GD in the first round if the sensor node has less than $c04-math-0061$ bad neighbors. The Chen algorithm's important steps are depicted in Algorithm 1.

Table 4.1 Summary of notations used in the DFD algorithms

Notation	Description
$c04-math-0017$	Set of all the sensors involved in the DFD algorithm
$c04-math-0018$	Set of neighbors of sensor node $c04-math-0019$
$c04-math-0020$	Measurement of $c04-math-0021$
$c04-math-0022$	Measurement difference between $c04-math-0023$ and $c04-math-0024$
at time $c04-math-0025$ (i.e., $c04-math-0026$ )
$c04-math-0027$	Measurement time difference, $c04-math-0028$
$c04-math-0029$	Measurement difference between $c04-math-0030$ and $c04-math-0031$
from time $c04-math-0032$ to $c04-math-0033$ (i.e., $c04-math-0034$ )
$c04-math-0035$	Test between $c04-math-0036$ and $c04-math-0037$ , $c04-math-0038$ , $c04-math-0039$
$c04-math-0040$ and $c04-math-0041$	Two predefined thresholds
$c04-math-0042$	Tendency value of a sensor: $c04-math-0043$

4.3.2 Fault Detection Algorithm 2: The Ding Algorithm

The second algorithm, which we refer to as the Ding algorithm [91], is a WSN fault detection algorithm with a low computational overhead. The results indicate that the algorithm can identify faulty sensors with relatively high accuracy even when the sensor failure probability is as high as 20%. The Ding algorithm's important steps are shown in Algorithm 2.

The algorithm compares the sensor node $c04-math-0062$ 's reading with the sensor node's $c04-math-0063$ neighbors $c04-math-0064$ with measurements $c04-math-0065$ . The comparison is done by comparing $c04-math-0066$ with median $c04-math-0067$ of $c04-math-0068$ , that is,

4.1

If there are $c04-math-0070$ sensors in total, the algorithm computes $c04-math-0071$ (i.e., $c04-math-0072$ ). The mean $c04-math-0073$ of $c04-math-0074$ is given by Ding et al. [91]:

4.2

The standard deviation $c04-math-0076$ of $c04-math-0077$ is given as [91]:

4.3

Standardizing the dataset $c04-math-0079$ yields $c04-math-0080$ , that is:

4.4

If $c04-math-0082$ , then the Ding algorithm detects $c04-math-0083$ as faulty where $c04-math-0084$ is a predefined threshold value. Mathematically, $c04-math-0085$ if $c04-math-0086$ where $c04-math-0087$ denotes the set of sensors claimed as faulty.

Section 4.5 discusses the implementation and effectiveness, in particular the detection accuracy, which is used implicitly in our Markov model as the coverage factor $c04-math-0088$ , of these fault detection algorithms.

4.4 Fault-Tolerant Markov Models

In this section, we present our Markov models for FT WSNs. Our Markov models are comprehensive and encompass hierarchically the sensor node, WSN cluster, and WSN levels (Fig. 4.1). We adopt a bottom-up paradigm in our modeling by first developing a sensor node model and determining the model's reliability and MTTF [130]. The sensor node MTTF gives the average sensor node failure rate, which is utilized in the WSN cluster model. The WSN cluster model gives the WSN cluster reliability and MTTF, which determines the WSN cluster average failure rate. The WSN cluster average failure rate is then utilized in the WSN model to determine the WSN reliability and MTTF. Our bottom-up approach enables WSN reliability and MTTF characterization by leveraging sensor node and WSN cluster models. We note that some WSN architectures do not employ cluster-based hierarchy due to the additional energy cost associated with cluster formation and cluster head election. However, our modeling approach is equally applicable for WSN architectures without a cluster-based hierarchy by assuming that the WSN is composed of just one cluster where the cluster head is the sink node. For clarity, Table 4.2 summarizes important notations used in our Markov models.

Table 4.2 Summary of notations used in our Markov models

Notation	Description
$c04-math-0089$	Total number of sensors in a WSN
$c04-math-0090$	Coverage factor
$c04-math-0091$	Cumulative sensor failure probability
$c04-math-0092$	Sensor failure rate
$c04-math-0093$	Time over which sensor failure rate is specified
$c04-math-0094$	Probability of being in state $c04-math-0095$ at time $c04-math-0096$
$c04-math-0097$	Reliability of FT (duplex) sensor node
$c04-math-0098$	Mean time to failure of an FT sensor node
$c04-math-0099$	Average number of neighbor sensor nodes
$c04-math-0100$	FT sensor node failure rate with $c04-math-0101$ neighbors
$c04-math-0102$	WSN cluster reliability
$c04-math-0103$	WSN cluster failure rate with $c04-math-0104$ sensor nodes
$c04-math-0105$	Number of clusters in the WSN
$c04-math-0106$	Wsn reliability

4.4.1 Fault-Tolerance Parameters

Fault tolerance in WSNs is highly dependent on fault diagnosis in WSNs (Section 4.2). In this section, we characterize the FT parameters by exploiting the synergy between fault detection and FT in WSNs. The FT parameters leveraged in our Markov model are the coverage factor and sensor failure rate.

4.4.1.1 Coverage Factor

The coverage factor $c04-math-0107$ is defined as the probability that the faulty active sensor is correctly diagnosed, disconnected, and replaced by a good inactive spare sensor. The $c04-math-0108$ estimation is critical in an FT WSN model and can be determined by

4.5

where $c04-math-0110$ denotes the accuracy of the fault detection algorithm in diagnosing faulty sensors and $c04-math-0111$ denotes the probability of an unsuccessful replacement of the identified faulty sensor with the good spare sensor. $c04-math-0112$ depends on the sensor switching circuitry and is usually a constant, whereas $c04-math-0113$ 's estimation is challenging as different fault detection algorithms have different accuracies. We analyze different fault detection algorithms [91, 102, 103, 106] and observe that the accuracy of a fault detection algorithm depends on the average number of sensor node neighbors $c04-math-0114$ and the cumulative probability of sensor failure $c04-math-0115$ . We model $c04-math-0116$ with the empirical relation:

4.6

where $c04-math-0118$ is a function of $c04-math-0119$ and denotes an adjustment parameter that may correspond loosely to the desired average number of neighboring sensor nodes required to achieve a good fault detection accuracy for a given $c04-math-0120$ . We have derived Eq. (4.6) by experimenting with the relationship between the fault detection algorithms' parameters (i.e., the average number of sensor node neighbors), sensor failure probability, and the fault detection algorithms' accuracy. We point out that Eq. (4.6) provides a good estimate of $c04-math-0121$ in general for any fault detection algorithm, whereas exact $c04-math-0122$ for a particular fault detection algorithm can be derived from the algorithm's mathematical model. Equation (4.6) approximates the relationship between a fault detection algorithm's parameters to obtain the fault detection algorithm's accuracy (in case the accuracy is unknown). In practice, a fault detection algorithm's accuracy should be determined accurately by the algorithm's designer that would alleviate the need of using an approximation (as in Eq. (4.6)). To clarify further, we point out that our Markov models are independent of $c04-math-0123$ 's determination methodology and are equally applicable to any $c04-math-0124$ value.

4.4.1.2 Sensor Failure Rate

The sensor failure rate can be represented by exponential distribution with a failure rate of $c04-math-0125$ over the period $c04-math-0126$ [131]. The exponential distribution, which has a property of constant failure rate, is a good model for the long, flat intrinsic failure portion of the bathtub curve. The exponential distribution is applicable to a variety of practical situations since many embedded components and systems spend most of their lifetimes in the flat portion of the bathtub curve. Furthermore, any failure rate curve can be approximated by piecewise exponential distribution segments patched together. Each exponential distribution segment in the overall approximation specifies a constant failure rate over a small time unit (e.g., daily, weekly, or monthly) that is the average of the failure rates during the respective time duration. The constant failure rate assignment for each exponential distribution segment is justifiable as many natural phenomena have a constant failure rate (or occurrence rate) property (e.g., the arrival rate of cosmic ray alpha particles). The failure rate curve approximation by piecewise exponential distributions is analogous to a curve approximation by piecewise straight line segments.

The exponential model works well for the interarrival times where the total number of events in a given time period is given by the Poisson distribution. When these events trigger failures, the exponential lifetime distribution model naturally applies [132]. The cumulative distribution function (CDF) for the sensors with exponentially distributed failure rate is

4.7

where $c04-math-0128$ denotes the cumulative probability of sensor failure (for simplicity) and $c04-math-0129$ signifies the time over which $c04-math-0130$ is specified. Solving Eq. (4.7) for $c04-math-0131$ gives

4.8

4.4.2 Fault-Tolerant Sensor Node Model

As a base case, we describe an NFT sensor node Markov model (Fig. 4.4) containing one sensor (temperature sensor in this case, but the sensor type is arbitrary). The NFT sensor node model consists of two states: state 1 (good state) and state 0 (failed state). The NFT sensor node fails when the node transitions from state 1 to state 0 due to a sensor failure. The differential equations describing the NFT sensor node Markov model are

4.9

where $c04-math-0134$ denotes the probability that the sensor node will be in state $c04-math-0135$ at time $c04-math-0136$ , and $c04-math-0137$ represents the first-order derivative of $c04-math-0138$ . $c04-math-0139$ represents the failure rate of an active temperature sensor.

c04f004 — **Figure 4.4** A non-FT (NFT) sensor node Markov model

Solving Eq. (4.9) with the initial conditions $c04-math-0140$ and $c04-math-0141$ yields

4.10

The reliability of the NFT sensor node is

4.11

The MTTF of the NFT sensor node is

4.12

The average failure rate of the NFT sensor node is

4.13

Since sensors have comparatively higher fault rates than other components [88, 93, 94], we propose an FT duplex sensor node model consisting of one active sensor and one inactive spare sensor. Using TMR [92] for FT is a possible scenario, but we consider a duplex sensor node model to minimize the additional cost as the additive cost of spare sensors can be prohibitive for large WSNs. In addition, a duplex model limits the increase in sensor node size. We point out that our modeling methodology can be extended to sensor nodes with NMR.

In our duplex sensor node, the inactive sensor becomes active only once the active sensor is declared faulty by the fault detection algorithm. We refer to our duplex sensor node as an FT sensor node, whose Markov model is depicted in Fig 4.5. The states in the Markov model represent the number of good sensors. The differential equations describing the FT sensor node Markov model are

4.14

where $c04-math-0147$ denotes the probability that the sensor node will be in state $c04-math-0148$ at time $c04-math-0149$ and $c04-math-0150$ represents the first-order derivative of $c04-math-0151$ . $c04-math-0152$ represents the failure rate of an active temperature sensor and $c04-math-0153$ is the rate at which recoverable failure occurs. The probability that the sensor failure cannot be recovered is $c04-math-0154$ and the rate at which unrecoverable failure occurs is $c04-math-0155$ .

c04f005 — **Figure 4.5** FT sensor node Markov model [130]

Solving Eq. (4.14) with the initial conditions $c04-math-0156$ , $c04-math-0157$ , and $c04-math-0158$ yields

4.15

The reliability of the FT sensor node is

4.16

where subscript $c04-math-0161$ in $c04-math-0162$ stands for duplex. The MTTF of the FT sensor node is

4.17

The average failure rate of the FT sensor node depends on $c04-math-0164$ (since the fault detection algorithm's accuracy depends on $c04-math-0165$ (Section 4.4.1)):

4.18

where $c04-math-0167$ and $c04-math-0168$ denote the failure rate and MTTF of an FT sensor node with $c04-math-0169$ neighbors.

4.4.3 Fault-Tolerant WSN Cluster Model

A typical WSN consists of many clusters (Fig. 4.1), and we assume for our model that all nodes in a cluster are neighbors to each other (could be one-hop or two-hop neighbors depending on the topology). We note that our model does not impose any specific topology on a sensor network cluster. If the average number of nodes in a cluster is $c04-math-0170$ , then the average number of neighboring nodes per sensor node is $c04-math-0171$ . Figure 4.6 depicts our Markov model for a WSN cluster. We assume that a cluster fails (i.e., fails to perform its assigned application task) if the number of alive (nonfaulty) sensor nodes in the cluster reduces to $c04-math-0172$ . The differential equations describing the WSN cluster Markov model are

4.19

where $c04-math-0174$ , $c04-math-0175$ , and $c04-math-0176$ represent the FT (duplex) sensor node failure rate (Eq. (4.18)) when the average number of neighboring sensor nodes are $c04-math-0177$ , $c04-math-0178$ , and $c04-math-0179$ , respectively. For mathematical tractability and closed-form solution, we analyze a special (simple) case when $c04-math-0180$ , which reduces the Markov model to three states (Fig. 4.7). The differential equations describing the WSN cluster Markov model when $c04-math-0181$ are

4.20

c04f006 — **Figure 4.6** WSN cluster Markov model [130]

c04f007 — **Figure 4.7** WSN cluster Markov model with three states [130]

Solving Eq. (4.20) with the initial conditions $c04-math-0183$ , $c04-math-0184$ , and $c04-math-0185$ yields

4.21

where $c04-math-0187$ denotes $c04-math-0188$ in Eq. (4.21) for conciseness. The reliability of the WSN cluster is

4.22

The MTTF of the WSN cluster is

4.23

where $c04-math-0191$ denotes $c04-math-0192$ in Eq. (4.23) for conciseness. The average failure rate of the cluster $c04-math-0193$ depends on the average number of nodes in the cluster $c04-math-0194$ at deployment time:

4.24

where $c04-math-0196$ denotes the MTTF of a WSN cluster of $c04-math-0197$ sensor nodes.

4.4.4 Fault-Tolerant WSN Model

A typical WSN consists of $c04-math-0198$ clusters where $c04-math-0199$ denotes the total number of sensor nodes in the WSN and $c04-math-0200$ denotes the average number of nodes in a cluster. Figure 4.8 depicts our WSN Markov model. We assume that the WSN fails to perform the WSN's assigned task when the number of alive clusters reduces to $c04-math-0201$ . The differential equations describing the WSN Markov model are

4.25

where $c04-math-0203$ represents the average cluster failure rate (Eq. (4.24)) when the cluster contains $c04-math-0204$ sensor nodes at deployment time. For mathematical tractability, we analyze a special (simple) case when $c04-math-0205$ , which reduces the Markov model to three states. The differential equations describing the WSN Markov model when $c04-math-0206$ are

4.26

Solving Eq. (4.26) with the initial conditions $c04-math-0208$ , $c04-math-0209$ , and $c04-math-0210$ yields

4.27

The WSN reliability is

4.28

The WSN MTTF when $c04-math-0213$ is

4.29

Discussion: Our hierarchical Markov modeling exploits the relationship between failure rates at the sensor component, sensor node, WSN cluster, and WSN levels. Markov modeling of sensor nodes (Section 4.4.2) takes the failure rate of the sensor node's components $c04-math-0215$ (e.g., temperature sensor) as input and then derives the failure rate of the sensor node $c04-math-0216$ (for simplicity, the failure rate of one component is considered for illustration; however, the failure rates of different components can be considered or combined into a single representative failure rate).

c04f008 — **Figure 4.8** WSN Markov model [130]

For an FT sensor node model (e.g., duplex sensor node or triple modular redundant sensor node), the average failure rate of the FT sensor node implicitly depends on $c04-math-0217$ since the fault detection algorithm's accuracy depends on $c04-math-0218$ (Sections 4.3 and 4.4.1.1). Hence, the failure rate of an FT sensor node is calculated for different $c04-math-0219$ values in our numerical results (Section 4.6). The WSN cluster Markov model relies on the failure rates calculated from the FT sensor node Markov model for different values of $c04-math-0220$ . For example, for a cluster with $c04-math-0221$ sensor nodes (Fig. 4.6) and assuming each sensor node has all other sensor nodes in the cluster as neighbors, the number of neighboring sensor nodes for a given sensor node is $c04-math-0222$ . Thus, the failure rate of an FT sensor node is calculated from the sensor node Markov model with $c04-math-0223$ , which is then used in the WSN cluster Markov model (Fig. 4.6). When a sensor node fails, the remaining operational (healthy) neighboring nodes of a given sensor node in a cluster become $c04-math-0224$ ; thus, the failure rate of an FT sensor node is calculated from the sensor node's Markov model with $c04-math-0225$ and so on. The cluster becomes nonoperational when the number of operational sensor nodes in the cluster reaches a minimum $c04-math-0226$ and consequently the overall failure rate of a WSN cluster with $c04-math-0227$ sensor nodes is calculated for that $c04-math-0228$ .

The WSN cluster failure rate with $c04-math-0229$ sensor nodes obtained from the WSN cluster Markov model is then used in the WSN Markov model (Section 4.4.4). The WSN is assumed to be consisting of $c04-math-0230$ clusters and WSN becomes nonoperational when the number of alive (operational) clusters reduces to $c04-math-0231$ . The WSN Markov model is then used to calculate the MTTF and reliability of a given WSN using the cluster failure rate calculated from the WSN cluster Markov model.

4.5 Simulation of Distributed Fault Detection Algorithms

In this section, we discuss our implementation of two DFD algorithms: the Chen and Ding algorithms (Section 4.3). Our analysis is challenging since it is difficult to compare fault detection algorithms without a standard benchmark (Section 4.7.2). Since different DFD algorithms can use different fault models, a comparison between DFD algorithms can be made by observing the performance of the algorithms under similar fault models and similar topologies.

4.5.1 Using ns $c04-math-0232$ 2 to Simulate Faulty Sensors

In prior work, the Chen and Ding algorithms were simulated strictly algorithmically using simplified fault models that approximated a Poisson process. To compare these two models as realistically as possible, we use ns $c04-math-0233$ 2 [101], which is a widely used discrete event simulator for both wired and wireless networks, and a custom fault model. ns $c04-math-0234$ 2 allows us to see how these algorithms perform atop a detailed network stack over a wireless channel. We have leveraged ns $c04-math-0235$ 2 instead of ns $c04-math-0236$ 3 as ns $c04-math-0237$ 2 provides better support for WSN and mobile ad hoc network (MANET) protocols [133].

Wireless entities in ns $c04-math-0238$ 2 are called nodes and have modules called agents representing network layers stacked atop the nodes. Agents are typically used to generate application traffic, but we use agents as a fault detection unit in a sensor node (Fig. 4.9). The agent first receives scheduled, scripted commands to read values from a SensorNodeData module. The SensorNodeData module generates random sensor readings based on our implemented fault models.

c04f009 — **Figure 4.9** The ns $c04-math-0239$ 2-based simulation architecture

After recording a data sample, the sensor node broadcasts the data as a packet and builds a record of each broadcasted packet that the sensor node receives from other sensor nodes. After a listening period, a separately scheduled event commands the detection agent to apply the two fault detection algorithms (i.e., the Ding and Chen algorithms). A fault tracker unit records statistics of detected and undetected errors. In the case of the Chen algorithm, each round of tendency broadcasting occurs in two-second intervals. The sensor readings are also broadcasted similarly in two-second intervals.

A duplex configuration was implemented for permanent errors. When a sensor node passes a threshold of detected errors, the sensor “fails” and is replaced by a spare sensor. For the simplified fault model, this replacement helps little because Poisson events are memoryless, and a sensor node with a new sensor and a sensor node with an old sensor will have the same error rate. However, for realistic data, constant errors are permanent and are repaired when the sensor is replaced. The realistic data usage also results in a penalty of false positives and causes more sensor nodes to be identified as faulty. The following sections describe our experiments with both simulated and real-world data.

4.5.2 Experimental Setup for Simulated Data

To compare the performance of each algorithm, it is useful to observe each algorithm using a simplified fault model; thus, we consider the scenario used by Chen et al. [90]. Each sensor records a uniformly random temperature between 70 $c04-math-0240$ and 75 $c04-math-0241$ . Errors occur as a Poisson process, with an arbitrary, scalable mean, and corrupt a single sensor on a single sensor node during the event. A corrupted sample is randomly distributed between 100 $c04-math-0242$ and 105 $c04-math-0243$ . We simulate each sensor node for 4 h and take samples every 30 s.

Fig. 4.10(a) and (b) depicts the detection performance of the Chen algorithm across different parameters and error rates. We observe that for the Chen algorithm, the error rate affects detection performance significantly; however, the $c04-math-0244$ value has a relatively minor impact. Fig. 4.11(a) and (b) depicts the detection performance of the Ding algorithm across different parameters and error rates. For the Ding algorithm, as $c04-math-0245$ is scaled, the number of false positives scales proportionately. For both algorithms, low error rates improve detection accuracy as low error rates make errors more distinguishable. In the simulated data scenario, the Chen algorithm performs better than the Ding algorithm in terms of the percentage of false positives, and the detection accuracy is relatively less dependent on $c04-math-0246$ . However, this simulated scenario is very specific and consistent, which may not always be the case.

c04f010 — **Figure 4.10** Effectiveness and false positive rate of the Chen algorithm: (a) error detection accuracy for the Chen algorithm; (b) false positive rate of Chen algorithm

c04f011 — **Figure 4.11** Effectiveness and false positive rate for the Ding algorithm: (a) error detection accuracy for the Ding algorithm; (b) false positive rate for the Ding algorithm

4.5.3 Experiments Using Real-World Data

Since transient faults are manifested in different ways, a more realistic error model that closely follows a real-world scenario is desirable for analysis. For our real-world data case study, we leverage publicly available sensor data from the Intel Berkeley Research Laboratory [134]. The data comes from an indoor sensor network deployment that measures temperature, humidity, and other metrics. Using temperature data from 53 sensor nodes, we have produced a variety of empirical observations on which we have based our simulation.

We note a few empirical observations from the Intel Berkeley data. One observation is that the real-world data includes noise. Noise in sensor readings is correlated with past values (autocorrelated) and can have high power levels for certain time periods. To measure the effect of noise on sensor data, we apply a denoising filter available in the MATLAB Wavelet toolkit to the Intel Berkeley data to approximate a reference “noise-free” signal. We measure the power level of noise in the data (Fig. 4.12) and record the values in a text file. We play back the recorded noise power level values during simulation, and multiply the noise values with simulated white Gaussian noise to imitate the bursty behavior of noisy faults.

c04f012 — **Figure 4.12** Noise power levels in the Intel Berkeley sample

Another observation from the Intel Berkeley data is that for nearly all sensor nodes, sensor samples level off as they approach their end-of-life, resulting in a “constant” fault. A histogram of the times at which this occurs (Fig. 4.13) reveals that the distribution of sensor node lifetimes fits well into a Student's $c04-math-0247$ -distribution, which is supported by MATLAB's distribution fitting tools. For our experiments, we have recorded a set of several thousand random numbers fitting the Student's $c04-math-0248$ -distribution. The sensor nodes then randomly select a value from the recorded $c04-math-0249$ -distribution values to select a time for the fault occurrence.

c04f013 — **Figure 4.13** Distribution of constant error occurrences

Finally, we observe that voltage spikes are not frequent in the Intel Berkeley data set, occurring in only 6 out of 53 nodes. Because of the rare occurrence of voltage spikes, our simulations inject a spike once during the lifetime of each sensor node. The time of the spike occurrence is modeled using a uniform process. The base signal is a sine wave with a period of 2 h between 20 and 30 $c04-math-0250$ C. To allow minor variations in temperature readings sensed by sensor nodes, the temperature values are treated as being generated from a radiating heat source. The nodes furthest from the center have a lower temperature proportional to the inverse of the distance squared as in the free space path loss model. The noise and spike errors are added to the signal, and the constant fault causes the signal to hang at the signal's last valid value during an error. A signal is considered erroneous if the variation is more than half a degree from the signal's true value. In our experiments, the faulty signals are never repaired unless faults accumulate, and ultimately the sensor is replaced if diagnosed as faulty.

Figure 4.14 shows the performance of the Chen algorithm for the real-world data. Despite detecting over 90% of the faults in the simplified model, the algorithm fails to perform well with the real-world data, detecting fewer than 50% of erroneous readings. The poor performance of the Chen algorithm on the real-world data is due to the nature of the algorithm. For an error to be detected, the error must be not only disruptive, causing a node to have different values from the node's neighbors, but also transient and have a sharp transition boundary. Since noise events last for multiple samples, the Chen algorithm is not equipped to handle these errors.

c04f014 — **Figure 4.14** Error detection and false positive rate for the Chen algorithm using real-world data

The Ding algorithm's performance with real-world data is shown in Fig. 4.15. Similar to the Chen algorithm, the Ding algorithm's performance degrades as compared to the previous simplified model environment. However, the Ding algorithm still detects a majority of occurring faults. As with the previous simplified model experiment, the Ding algorithm is highly dependent on the algorithm's parameters. The higher the detection rate is, the higher the rate of false positives is for the Ding algorithm; however, the false positive rate remains low, peaking at 20%.

c04f015 — **Figure 4.15** Error detection and false positive rate for the Ding algorithm using real-world data

4.6 Numerical Results

In this section, we present reliability and MTTF results for our Markov models (Section 4.4) implemented in the SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) Software Package [135, 136]. SHARPE provides MTTF results directly based on our models' implementations; however, SHARPE does not provide reliability results directly, but reliability results can be calculated from state probabilities. We present example reliability calculations as well as detailed reliability and MTTF results for an NFT as well as an FT sensor node, WSN cluster, and WSN using our bottom-up Markov modeling paradigm.

4.6.1 Experimental Setup

In this section, we describe our FT sensor node, WSN cluster, and WSN model implementation in the SHARPE Software Package [135, 136]. We also present a typical fault detection algorithm's accuracy for different cumulative sensor failure probability $c04-math-0251$ values. Owing to SHARPE limitations that take only exponential polynomials, we assume our sensor failure rate to be exponentially distributed (Section 4.4.1.2).

SHARPE is a software tool for performance, reliability, and performability model specification and analysis. The SHARPE toolkit provides a specification language and solution methods for commonly used performance, reliability, and performability model types. Supported models of SHARPE include combinatorial models, state-space (e.g., Markov and semi-Markov reward models), and stochastic Petri nets. SHARPE allows computation of steady-state, transient, and interval measures. SHARPE allows output measures of a model to be used as parameters of other models to facilitate the hierarchical combination of different model types.

Our Markov model exploits the synergy of fault detection and fault tolerance for WSNs. Table 4.3 depicts $c04-math-0252$ values estimated for a DFD algorithm using Eq. (4.6) for different $c04-math-0253$ and $c04-math-0254$ values. These estimated $c04-math-0255$ values approximate the accuracy of the DFD algorithms proposed in [91, 102, 103, 106]. We note that any inaccuracy in the $c04-math-0256$ estimation does not affect our calculations of results because these calculations leverage $c04-math-0257$ values but are independent of whether these values reflect accurately a particular DFD algorithm's accuracy. We assume $c04-math-0258$ in Eq. (4.5) (i.e., once a faulty sensor is identified, the faulty sensor is replaced by a good spare sensor perfectly), which gives $c04-math-0259$ in Eq. (4.5).

Table 4.3 Estimated values for a fault detection algorithm's accuracy

$c04-math-0260$	$c04-math-0261$	$c04-math-0262$
$c04-math-0260$	$c04-math-0261$	5	6	7	10	15
0.05	25	0.979	1	1	1	1
0.1	50	0.858	0.895	0.921	0.957	0.96
0.2	56	0.755	0.717	0.81	0.845	0.851
0.3	65	0.652	0.679	0.699	0.732	0.742
0.4	66	0.558	0.5813	0.599	0.627	0.636
0.5	67	0.464	0.484	0.498	0.522	0.53
0.6	68	0.371	0.386	0.398	0.417	0.424
0.7	69	0.278	0.29	0.298	0.313	0.318
0.8	70	0.185	0.193	0.198	0.208	0.212
0.9	71	0.092	0.096	0.099	0.104	0.106
0.99	72	0.0092	0.00962	0.0099	0.0104	0.0106

4.6.2 Reliability and MTTF for an NFT and an FT Sensor Node

In this section, we present the reliability and MTTF results for an NFT and an FT sensor node for the two cases: $c04-math-0263$ and $c04-math-0264$ , when $c04-math-0265$ and $c04-math-0266$ . The case $c04-math-0267$ corresponds to a real-world fault detection algorithm (typical values are shown in Table 4.3), whereas the case $c04-math-0268$ corresponds to an ideal fault detection algorithm, which detects faulty sensors perfectly for all $c04-math-0269$ values. We calculate reliability at $c04-math-0270$ days since we assume $c04-math-0271$ days [137] in Eq. (4.8); however, we point out that reliability can be calculated for any other time value using our Markov models.

For an NFT sensor node reliability calculation, we require the sensor failure rate $c04-math-0272$ , which can be calculated using Eq. (4.8) (i.e., $c04-math-0273$ failures/day). SHARPE gives $c04-math-0274$ and sensor node reliability $c04-math-0275$ . Evaluating $c04-math-0276$ at $c04-math-0277$ gives $c04-math-0278$ = 0.94999.

For an FT sensor node when $c04-math-0279$ , different reliability results are obtained for different $c04-math-0280$ because the fault detection algorithm's accuracy and coverage factor $c04-math-0281$ depend on $c04-math-0282$ . For $c04-math-0283$ , $c04-math-0284$ (Table 4.3), SHARPE gives $c04-math-0285$ = $c04-math-0286$ and $c04-math-0287$ = $c04-math-0288$ . The reliability $c04-math-0289$ = $c04-math-0290$ , which when evaluated at $c04-math-0291$ = 100 gives $c04-math-0292$ = 0.99770. For an FT sensor node when $c04-math-0293$ for all $c04-math-0294$ , SHARPE gives $c04-math-0295$ and $c04-math-0296$ . Using Eq. (4.16), the reliability $c04-math-0297$ = 0.99872.

Table 4.4 shows the reliability for an NFT and an FT sensor node evaluated at $c04-math-0298$ days for $c04-math-0299$ values of 5, 10, and 15. As expected, the results show that reliability decreases for both NFT and FT sensor nodes as $c04-math-0300$ increases, and the reliability attained by an FT sensor node (both for $c04-math-0301$ and $c04-math-0302$ ) is always better than an NFT sensor node. For example, the percentage reliability improvement achieved by an FT sensor node with $c04-math-0303$ , $c04-math-0304$ over an NFT sensor node is 18.98% when $c04-math-0305$ . However, an FT sensor node with $c04-math-0306$ outperforms both an FT sensor node with $c04-math-0307$ and an NFT sensor node for all $c04-math-0308$ values. For example, the percentage improvement in reliability for an FT sensor node with $c04-math-0309$ over an NFT sensor node is 69.09%, and the percentage improvements in reliability for an FT sensor node with $c04-math-0310$ over an FT sensor node ( $c04-math-0311$ ) with $c04-math-0312$ equal to 5, 10, and 15 are 28.11%, 24.32%, and 23.82%, respectively, for $c04-math-0313$ . The percentage improvement in reliability attained by an FT sensor node with $c04-math-0314$ over an NFT sensor node is 230%, and the percentage improvements in reliability for an FT sensor node with $c04-math-0315$ over an FT sensor node ( $c04-math-0316$ ) with $c04-math-0317$ equal to 5, 10, and 15 are 172.36%, 166.31%, and 165.32%, respectively, for $c04-math-0318$ . These results reveal that the percentage improvement in reliability attained by an FT sensor node with $c04-math-0319$ increases as $c04-math-0320$ because for an FT sensor node with $c04-math-0321$ , $c04-math-0322$ decreases as $c04-math-0323$ (Table 4.3). The sensor node reliability analysis reveals that a robust fault detection algorithm with $c04-math-0324$ for all $c04-math-0325$ and $c04-math-0326$ values is necessary to attain good reliability for an FT sensor node.

c04-math-0328 — **Table 4.4** Reliability for an NFT and an FT sensor node

Figure 4.16 depicts the MTTF for an NFT and an FT sensor node for $c04-math-0333$ values of 5, 10, and 15 versus the sensor failure probability $c04-math-0334$ when $c04-math-0335$ in Eq. (4.7) is 100 days [91, 106]. The results show that the MTTF for an FT sensor node improves with increasing $c04-math-0336$ . However, the MTTF shows negligible improvement when $c04-math-0337$ over $c04-math-0338$ as the fault detection algorithm's accuracy improvement gradient (slope) decreases for large $c04-math-0339$ values.

c04f016 — **Figure 4.16** MTTF in days for an NFT and an FT sensor node [130]

Figure 4.16 also compares the MTTF for an FT sensor node when $c04-math-0340$ . When $c04-math-0341$ for existing fault detection algorithms, comparison with $c04-math-0342$ provides insight into how the fault detection algorithm's accuracy affects the sensor node's MTTF. Figure 4.16 shows that the MTTF for an FT sensor node with $c04-math-0343$ is always greater than an FT sensor node with $c04-math-0344$ .

Figure 4.16 shows that the MTTF for an NFT and an FT sensor node decreases as $c04-math-0345$ increases; however, the FT sensor node maintains better MTTF than the NFT sensor node for all $c04-math-0346$ values. This observation reveals that a sensor with a lower failure probability achieves better MTTF for both the NFT and FT sensor nodes. Figure 4.16 also shows that the MTTF for both the NFT and the FT sensor nodes approaches zero as $c04-math-0347$ approaches 1 (i.e., MTTF $c04-math-0348$ 0 $c04-math-0349$ 1). This observation is intuitive because a faulty sensor (with $c04-math-0350$ ) is unreliable and leads to a failed sensor node with zero MTTF and suggests that depending on the application's reliability and MTTF requirements, the faulty sensor should be replaced before $c04-math-0351$ approaches 1.

Table 4.5 depicts the percentage MTTF improvement gained by an FT sensor node over an NFT sensor node for different values of $c04-math-0352$ . We calculate the percentage improvement as $c04-math-0353$ . The table shows that the MTTF percentage improvement for an FT sensor node decreases as $c04-math-0354$ increases when $c04-math-0355$ . The percentage MTTF improvement for an FT sensor node with $c04-math-0356$ and $c04-math-0357$ are 85.75% and 95.65%, respectively, for $c04-math-0358$ and drops to 0.875% and 1.34%, respectively, for $c04-math-0359$ . This observation reveals that having more neighboring sensor nodes (a higher $c04-math-0360$ value) improves MTTF because, in general, a fault detection algorithm's accuracy and $c04-math-0361$ improve with increasing $c04-math-0362$ . Table 4.5 also shows that the MTTF percentage improvement for an FT sensor node over an NFT sensor node is 100% on average when $c04-math-0363$ , thus highlighting the importance of a robust fault detection algorithm.

c04-math-0365 — **Table 4.5** Percentage MTTF improvement for an FT sensor node as compared to an NFT sensor node

Model Validation: We compare the MTTF results for an NFT sensor node obtained using our Markov model with other existing models in the literature as well as practical sensor node implementations to provide insights into the accuracy of our models. We note that a detailed comparison and verification of our model with existing models is not possible because there are no similar/related models that leverage FT constructs. Jung et al. [71] modeled the lifetime of trigger-driven sensor nodes with different event arrival rates (trigger-driven sensor nodes perform processing based on events as opposed to duty-cycle-based sensor nodes that perform processing based on a duty cycle), which can be interpreted as sensing events that depend on the sensor node's sensing frequency. Results were obtained for three trigger-driven sensor node platforms: XSM, MICAz, and Telos. XSM and MICAz consisted of an ATmega128L 8-bit processor with a maximum processor frequency of 16 MHz and a maximum performance of 16 million instructions per second (MIPS). Telos consisted of a TI MSP430 16-bit processor with a maximum processor frequency of 8 MHz and a maximum performance of 16 MIPS. Results revealed that Telos, XSM, and MICAz delivered lifetimes of 1500, 1400, and 600 days, respectively, with an event arrival rate of 0.1 events/h, which dropped to 350, 100, and 0 days, respectively, with an event arrival rate of 100 events/h. The lifetime values for an NFT sensor node obtained from our Markov model ranges from 949 to 22 days for different sensor failure probabilities. Depending on the sensor node energy consumption, the lifetime model proposed in [70] estimated that the sensor node lifetime varied from 25 to 215 days. These comparisons indicate that our Markov model yields sensor node lifetime values in the range obtained from other existing models.

We also compare the NFT sensor node lifetime estimation from our model with an experimental study on sensor node lifetimes [74]. This comparison verifies conformity of our model with real measurements. For example, with a sensor node battery capacity of 2500 mA h, experiments indicate a sensor node lifetime ranging from 72 to 95 h for a 100% duty cycle for different battery brands (e.g., Ansmann, Panasonic Industrial, Varta High Energy, Panasonic Extreme Power) [74]. The sensor node lifetime for a duty cycle of 18% can be estimated as 95/0.18 = 528 h $c04-math-0368$ 22 days and for a duty cycle of 0.42% as 95/0.0042 = 22,619 h $c04-math-0369$ 942 days. This comparison reveals that our NFT sensor node lifetime model estimates lifetime values in the range obtained by experimental sensor node implementations.

4.6.3 Reliability and MTTF for an NFT and an FT WSN Cluster

In this section, we present the reliability and MTTF results for two WSN clusters that contain on average $c04-math-0370$ and $c04-math-0371$ sensor nodes at deployment time ( $c04-math-0372$ ). The selection of two different $c04-math-0373$ values provides insight into how the size of the cluster affects reliability and MTTF (other $c04-math-0374$ and $c04-math-0375$ values depicted similar trends). The NFT WSN cluster consists of NFT sensor nodes and the FT WSN cluster consists of FT sensor nodes. For brevity, we present example WSN cluster reliability calculations for $c04-math-0376$ ( $c04-math-0377$ ) and $c04-math-0378$ .

Leveraging our bottom-up approach model, the NFT WSN cluster uses the failure rate $c04-math-0379$ of an NFT sensor node. By using Eq. (4.8), $c04-math-0380$ = $c04-math-0381$ failures/day. Since the coverage factor $c04-math-0382$ does not appear in the reliability (and MTTF) calculation for an NFT sensor node, the failure rate $c04-math-0383$ = $c04-math-0384$ = $c04-math-0385$ = $c04-math-0386$ failures/day. In other words, for an NFT sensor node, $c04-math-0387$ = $c04-math-0388$ , $c04-math-0389$ , and $c04-math-0390$ = $c04-math-0391$ , which denote the sensor node failure rate when the average number of neighboring sensor nodes are 4 and 5, respectively. The NFT WSN cluster reliability calculation utilizes $c04-math-0392$ and $c04-math-0393$ , which gives $c04-math-0394$ = $c04-math-0395$ and $c04-math-0396$ = $c04-math-0397$ . The reliability $c04-math-0398$ = $c04-math-0399$ + $c04-math-0400$ , which when evaluated at $c04-math-0401$ days gives $c04-math-0402$ .

For an FT WSN cluster when $c04-math-0403$ , the reliability calculation requires $c04-math-0404$ and $c04-math-0405$ , which depends on $c04-math-0406$ as the fault detection algorithm's accuracy and $c04-math-0407$ depends on $c04-math-0408$ , yielding different reliability results for different $c04-math-0409$ values. Using Table 4.3 and Eq. (4.18), $c04-math-0410$ = $c04-math-0411$ = $c04-math-0412$ = $c04-math-0413$ = $c04-math-0414$ failures/day and $c04-math-0415$ = $c04-math-0416$ = $c04-math-0417$ = $c04-math-0418$ = $c04-math-0419$ failures/day. SHARPE gives $c04-math-0420$ = $c04-math-0421$ and $c04-math-0422$ . The reliability of the FT WSN cluster is calculated as $c04-math-0423$ = 0.95969.

For an FT WSN cluster when $c04-math-0424$ , the reliability calculation does not depend on $c04-math-0425$ , which gives $c04-math-0426$ failures/day and $c04-math-0427$ failures/day. SHARPE gives $c04-math-0428$ and $c04-math-0429$ , from which we calculate $c04-math-0430$ .

Table 4.6 shows the reliability for an NFT and an FT WSN cluster (for $c04-math-0431$ and $c04-math-0432$ ) evaluated at $c04-math-0433$ days when $c04-math-0434$ ( $c04-math-0435$ ). We observe similar trends as with sensor node reliability (Table 4.4) where the reliability of both NFT and FT WSN clusters decreases as $c04-math-0436$ increases (i.e., reliability $c04-math-0437$ ) because decreased individual sensor node reliability decreases the overall WSN cluster reliability. Table 4.6 depicts that an FT WSN cluster with $c04-math-0438$ outperforms an FT WSN cluster with $c04-math-0439$ and an NFT WSN cluster for all $c04-math-0440$ values. For example, the percentage improvement in reliability for an FT WSN cluster with $c04-math-0441$ over an NFT WSN cluster and an FT WSN cluster with $c04-math-0442$ is 77.33% and 12.02% for $c04-math-0443$ and 601.12% and 142.73% for $c04-math-0444$ , respectively. These results show that the percentage improvement in reliability attained by an FT WSN cluster increases as $c04-math-0445$ increases because a fault detection algorithm's accuracy and $c04-math-0446$ decrease as $c04-math-0447$ increases (Table 4.3). This trend is similar to the percentage improvement in reliability for an FT sensor node (Section 4.6.2). The results show that an FT WSN cluster always performs better than an NFT WSN cluster. For example, the percentage improvement in reliability for an FT WSN cluster with $c04-math-0448$ over an NFT WSN cluster is 58.31% for $c04-math-0449$ .

**Table 4.6** Reliability for an NFT and an FT WSN cluster when $c04-math-0450$ ( $c04-math-0451$ )

Figure 4.17 depicts the MTTF for an NFT WSN cluster and an FT WSN cluster versus $c04-math-0462$ when $c04-math-0463$ for average cluster sizes of $c04-math-0464$ and $c04-math-0465$ sensor nodes, respectively, at deployment time. The figure reveals that the FT WSN cluster's MTTF is considerably greater than the NFT WSN cluster's MTTF for both cluster sizes. Figure 4.17 also compares the MTTF for an FT WSN cluster when $c04-math-0466$ with $c04-math-0467$ and shows that the MTTF for an FT WSN cluster with $c04-math-0468$ is always better than an FT WSN cluster with $c04-math-0469$ . This observation again verifies the significance of a robust (ideal) fault detection algorithm, which can ideally provide $c04-math-0470$ values close to 1 (i.e., $c04-math-0471$ for all $c04-math-0472$ values). We note that both an NFT and an FT WSN clusters with $c04-math-0473$ have redundant sensor nodes and can inherently tolerate $c04-math-0474$ sensor node failures. The WSN cluster with $c04-math-0475$ has more redundant sensor nodes than the WSN cluster with $c04-math-0476$ , resulting in a comparatively greater MTTF. Figure 4.17 shows that the MTTF for both an NFT and an FT WSN clusters approaches zero as $c04-math-0477$ approaches 1 and further solidifies the necessity of low failure probability sensors to achieve better MTTF in WSN clusters. The MTTF variation for a WSN cluster with varying $c04-math-0478$ is similar to the MTTF variation for a sensor node (Fig. 4.16) because a WSN cluster is composed of sensor nodes and reflects the MTTF variation of WSN cluster's constituent sensor nodes.

c04f017 — **Figure 4.17** MTTF in days for an NFT WSN cluster and an FT WSN cluster with $c04-math-0479$ [130]

Table 4.7 shows the percentage MTTF improvement for an FT WSN cluster as compared to an NFT WSN cluster for cluster sizes of $c04-math-0485$ and $c04-math-0486$ sensor nodes. The percentage improvements are calculated separately for the two cluster sizes and we compare the MTTFs of clusters of the same size. We observe that the MTTF improvement for an FT cluster is slightly greater when $c04-math-0487$ as compared to when $c04-math-0488$ ; however, the MTTF improvement decreases with increasing $c04-math-0489$ when $c04-math-0490$ . This observation reveals that a WSN cluster consisting of more sensor nodes can achieve better MTTF improvements when compared to a WSN cluster containing fewer sensor nodes. The MTTF percentage improvement for an FT WSN cluster with $c04-math-0491$ , $c04-math-0492$ , is 83.04% for $c04-math-0493$ and drops to 2.26% for $c04-math-0494$ . Similarly, the percentage MTTF improvement for an FT WSN cluster with $c04-math-0495$ , $c04-math-0496$ , is 87.55% for $c04-math-0497$ and drops to 2.47% for $c04-math-0498$ . The percentage MTTF improvement for both cluster sizes is 100% on average when $c04-math-0499$ (results are not shown for $c04-math-0500$ when $c04-math-0501$ in Table 4.7 for brevity). We observed that the MTTF percentage improvement for an FT WSN cluster with $c04-math-0502$ over $c04-math-0503$ is 103.35% on average. Intuitively, these results confirm that clusters with more redundant sensor nodes are more reliable and have a better MTTF than clusters with fewer redundant sensor nodes.

Table 4.7 Percentage MTTF improvement for an FT WSN cluster as compared to an NFT WSN cluster ( $c04-math-0480$ )

Probability	FT	FT	FT
$c04-math-0481$	$c04-math-0482$	$c04-math-0483$	$c04-math-0484$
0.1	83.04	87.55	100
0.2	73.84	77.24	100
0.3	63.58	66.51	100
0.5	44.97	47.22	100
0.7	26.4	28.71	100
0.9	9.81	9.57	100
0.99	2.26	2.47	100

Since SHARPE provides information on MTTF directly from Markov models whereas reliability calculation requires manual substitution in state transition probabilities, we present iso-MTTF results for WSN clusters. Table 4.8 depicts the MTTFs for an FT WSN cluster and an NFT WSN cluster for $c04-math-0504$ varying from 0.1 to 0.99 and when $c04-math-0505$ = 4. Results indicate that an NFT WSN cluster requires three redundant NFT sensor nodes to achieve a comparable MTTF as that of an FT WSN cluster.

Table 4.8 Iso-MTTF for WSN clusters ( $c04-math-0506$ ). $c04-math-0507$ denotes the redundant sensor nodes required by an NFT WSN cluster to achieve a comparable MTTF as that of an FT WSN cluster

Probability	MTTF FT (days)	MTTF NFT (days)	NFT
$c04-math-0508$	$c04-math-0509$	$c04-math-0510$	$c04-math-0511$
0.1	695.89	707.43	3
0.3	205.3	208.86	3
0.5	105.97	107.6	3
0.7	61.213	62.136	3
0.99	15.942	16.209	3

4.6.4 Reliability and MTTF for an NFT and an FT WSN

This section presents the reliability and MTTF results for an NFT WSN and an FT WSN containing $c04-math-0512$ and $c04-math-0513$ clusters at deployment time ( $c04-math-0514$ ). We consider WSNs containing different number of clusters to provide an insight into how the number of clusters affects WSN reliability and MTTF (other $c04-math-0515$ and $c04-math-0516$ values depicted similar trends). The FT WSN contains FT sensor nodes, and the NFT WSN contains NFT sensor nodes. We present example WSN reliability calculations for $c04-math-0517$ ( $c04-math-0518$ , i.e., each WSN fails when there are no more active clusters) and $c04-math-0519$ = 0.2. We assume that both WSNs contain clusters with $c04-math-0520$ ( $c04-math-0521$ ) = 9 sensor nodes on average with a cluster failure rate $c04-math-0522$ .

The reliability calculation for an NFT WSN requires the NFT WSN cluster failure rate $c04-math-0523$ , which can be calculated using Eq. (4.24) with $c04-math-0524$ (i.e., $c04-math-0525$ = $c04-math-0526$ failures/day). Using $c04-math-0527$ , SHARPE gives $c04-math-0528$ and $c04-math-0529$ . The WSN reliability $c04-math-0530$ when evaluated at $c04-math-0531$ days gives $c04-math-0532$ .

For an FT WSN when $c04-math-0533$ , the reliability calculation requires the FT WSN cluster failure rate $c04-math-0534$ (for $c04-math-0535$ ) (i.e., $c04-math-0536$ failures/day). Using $c04-math-0537$ , SHARPE gives $c04-math-0538$ , $c04-math-0539$ . The WSN reliability $c04-math-0540$ .

For an FT WSN cluster when $c04-math-0541$ , the reliability calculation requires the FT WSN cluster failure rate $c04-math-0542$ (for $c04-math-0543$ ) (i.e., $c04-math-0544$ failures/day). Using $c04-math-0545$ , SHARPE gives $c04-math-0546$ , $c04-math-0547$ , which gives $c04-math-0548$ .

Table 4.9 shows the reliability for an NFT WSN and an FT WSN evaluated at $c04-math-0556$ days when $c04-math-0557$ ( $c04-math-0558$ ) for clusters with nine sensor nodes on average. We observe similar trends as with sensor node reliability (Table 4.4) and WSN cluster reliability (Table 4.6) where reliability for both an NFT WSN and an FT WSN decreases as $c04-math-0559$ increases (i.e., reliability $c04-math-0560$ ) because a WSN contains clusters of sensor nodes and decreased individual sensor node reliability, with increasing $c04-math-0561$ decreases both WSN cluster and WSN reliability. Table 4.9 shows that an FT WSN with $c04-math-0562$ outperforms an FT WSN with $c04-math-0563$ and an NFT WSN for all $c04-math-0564$ values. For example, the percentage improvement in reliability for an FT WSN with $c04-math-0565$ over an NFT WSN and an FT WSN with $c04-math-0566$ is 5.1% and 0.51% for $c04-math-0567$ and 350.18% and 236.22% for $c04-math-0568$ , respectively. These results show that the percentage improvement in reliability attained by an FT WSN increases as $c04-math-0569$ increases because the fault detection algorithm's accuracy and $c04-math-0570$ decreases as $c04-math-0571$ increases (Table 4.3). This trend is similar to the percentage improvement in reliability for an FT sensor node (Section 4.6.2) and an FT WSN cluster (Section 4.6.3). The results show that an FT WSN always performs better than NFT WSN. For example, the percentage improvement in reliability for an FT WSN with $c04-math-0572$ over an NFT WSN is 4.57% for $c04-math-0573$ .

**Table 4.9** Reliability for an NFT WSN and an FT WSN when $c04-math-0549$ ( $c04-math-0550$ )

Figure 4.18 depicts the MTTF for an NFT WSN and an FT WSN containing on average $c04-math-0574$ and $c04-math-0575$ clusters ( $c04-math-0576$ ) at deployment time. The figure reveals that an FT WSN improves the MTTF considerably over an NFT WSN for both number of clusters. Figure 4.18 also shows that the MTTF for an FT WSN when $c04-math-0577$ is always greater than the MTTF for an FT WSN when $c04-math-0578$ . We observe that since the MTTF for an FT WSN drops nearly to that of an NFT WSN as $c04-math-0579$ , building a more reliable FT WSN requires low failure probability sensors. This WSN MTTF variation with $c04-math-0580$ follows similar trends as observed in WSN clusters (Fig. 4.17) and sensor nodes (Fig. 4.16). Similar MTTF variations for a WSN, WSN cluster, and a sensor node results from our bottom-up modeling approach (Section 4.4), where each hierarchical level captures the MTTF variation trends of lower levels. We observe that the MTTF for a WSN with $c04-math-0581$ is always greater than the MTTF for a WSN with $c04-math-0582$ . This observation is intuitive because WSNs with $c04-math-0583$ have more redundant WSN clusters (and sensor nodes) and can survive more cluster failures before reaching the failed state ( $c04-math-0584$ ) as compared to a WSN with $c04-math-0585$ .

c04f018 — **Figure 4.18** MTTF in days for an NFT WSN and an FT WSN with $c04-math-0586$ [130]

We observe the percentage MTTF improvement for an FT WSN over an NFT WSN containing on average $c04-math-0587$ and $c04-math-0588$ clusters. We observe that the MTTF improvement for both number of clusters decreases with increasing $c04-math-0589$ when $c04-math-0590$ (similar trend as for a WSN cluster and for a sensor node). The MTTF percentage improvement for an FT WSN with $c04-math-0591$ , $c04-math-0592$ , is 87.56% for $c04-math-0593$ and drops to 3.3% for $c04-math-0594$ . Similarly, the MTTF percentage improvement for an FT WSN with $c04-math-0595$ , $c04-math-0596$ , is 88.2% for $c04-math-0597$ and drops to 3.26% for $c04-math-0598$ . We observe that the MTTF improvement for an FT WSN with $c04-math-0599$ is 100% on average for all $c04-math-0600$ values and is greater than an FT WSN with $c04-math-0601$ . The MTTF percentage improvement for an FT WSN with $c04-math-0602$ over an FT WSN with $c04-math-0603$ is 52.22% on average.

We also investigate iso-MTTF for WSNs. Table 4.10 depicts the MTTFs for an FT WSN and an NFT WSN for $c04-math-0604$ varying from 0.1 to 0.99 and when $c04-math-0605$ = 0. Results indicate that an NFT WSN requires nine redundant NFT WSN clusters to achieve a comparable MTTF as that of an FT WSN.

Table 4.10 Iso-MTTF for WSNs ( $c04-math-0606$ ). $c04-math-0607$ denotes the redundant WSN clusters required by an NFT WSN to achieve a comparable MTTF as that of an FT WSN

Probability	MTTF FT (days)	MTTF NFT (days)	NFT
$c04-math-0608$	$c04-math-0609$	$c04-math-0610$	$c04-math-0611$
0.1	2121.6	2135.7	9
0.3	627.62	631.77	9
0.5	323.28	326.12	9
0.7	186.8	188.74	9
0.99	48.387	48.71	9

4.7 Research Challenges and Future Research Directions

WSNs are susceptible to a plethora of different fault types and external attacks after the WSN's deployment. Fault detection and tolerance in WSNs is paramount especially for safety-critical WSNs such as military target applications, biomedical applications (body sensor networks), and volcanic eruption detection systems. Although there has been research done in fault diagnosis and FT in WSNs, various issues remain to be addressed. This section highlights research challenges and future research directions for development and deployment of reliable and trustworthy WSNs.

4.7.1 Accurate Fault Detection

Our Markov modeling results in Section 4.6 reveal that the fault detection algorithm's accuracy plays a crucial role in realizing FT WSNs. An accuracy close or equal to 100% and a false alarm rate close or equal to 0% is desirable for fault detection algorithms. There is a need to develop novel fault detection algorithms and/or improve the existing fault detection algorithms with the objective of attaining $c04-math-0612$ 100% fault detection accuracy.

4.7.2 Benchmarks for Comparing Fault Detection Algorithms

There is a need to develop standard benchmarks for comparing different fault detection algorithms for various fault models and topologies. The development of these benchmarks would facilitate designers to select an appropriate fault detection algorithm for WSNs with given FT requirements.

4.7.3 Energy-Efficient Fault Detection and Tolerance

Since sensor nodes have stringent resource constraints, fault detection and FT approaches for WSNs need to be energy efficient. Although initiatives have been taken in developing DFD approaches to minimize energy overhead, the area requires further research. Moreover, FT techniques developed for WSNs need to be energy efficient. The impact of added redundancy in sensor nodes to provide FT not only has economic cost overhead but can also have power consumption overhead, which needs to be considered and minimized during the design. Energy efficiency in fault detection and tolerance can also be achieved by exploiting near-threshold computing (NTC) in sensor node design. NTC refers to using a supply voltage ( $c04-math-0613$ ) that is close to a single transistor's threshold voltage $c04-math-0614$ (generally $c04-math-0615$ is slightly above $c04-math-0616$ in near-threshold operation, whereas $c04-math-0617$ is below $c04-math-0618$ for subthreshold operation). Lowering the supply voltage reduces power consumption and increases energy efficiency by lowering the energy consumed per operation. The widespread adoption of NTC in sensor node designs for reduced power consumption requires addressing NTC challenges such as increased process, voltage, and temperature variations, subthreshold leakage power, and soft error rates.

4.7.4 Machine-Learning-Inspired Fault Detection

Machine learning-based approaches can be leveraged to classify normal and anomalous data. Since sensor nodes are resource constrained, innovative approaches are required to extract features from the sensed data. The machine learning techniques for sensor nodes need to reduce the training data size for the classifier (e.g., SVM) based on extracted features. The goal of machine-learning-inspired fault detection is to classify normal and anomalous data in or near real time. There exist some initiatives in machine-learning-inspired fault detection [110]; however, further work is required in this domain for accurate and energy-efficient fault detection.

4.7.5 FT in Multimedia Sensor Networks

Fault detection and tolerance in WMSNs is a relatively new research field [115]. FT data aggregation in WMSNs is imperative to reduce the throughput of data transmission, energy conservation, accuracy of event detection, and alleviate the interference from compromised nodes. Another research challenge in WMSNs is to ensure the trustworthiness of aggregated results. Since multimedia data sensed by sensor nodes require compute-intensive processing, multicore-based design of sensor nodes could meet the performance, energy, and FT requirements of WMSNs. Accurate fault detection, FT, reliability, and security in WMSNs require further research endeavors.

4.7.6 Security

Security is critical for many WSNs, such as military target tracking and biomedical applications. Security aspects are crucial to consider in developing a reliable, robust, and trustworthy WSNs. Since sensor nodes use wireless communication, WSNs are susceptible to eavesdropping, message injection, replay, and other attacks. In many situations, an adversary is capable of deploying malicious nodes in the WSN or compromising some legitimate sensor nodes. Since the sink node is a gateway between a WSN and the outside world, compromising the sink node could render the entire WSN useless. Hence, security integration in sink nodes is imperative. The security integration objective in WSNs is to provide confidentiality, integrity, authenticity, and availability of all messages in the presence of adversaries. In a secure and reliable WSN, every eligible receiver should receive all messages intended for that receiver. Furthermore, the receiver should be able to verify the integrity of each message as well as the identity of the sender.

Integration of security and privacy in sensor nodes is challenging because of the sensor nodes' resource constraints in terms of computation, communication, memory/storage, and energy supply. The security mechanisms for prevalent end-to-end systems (e.g., secure socket layer (SSL)) are not directly applicable to WSNs. In traditional end-to-end systems, it is neither necessary nor desirable for the contents of the message (except the message headers) to be available to the intermediate routers. On the contrary, in WSNs, the dominant traffic pattern is many-to-one (i.e., from sensor nodes to the sink node) [138]. Each intermediate sensor node in a WSN also needs access to message contents to perform in-network processing, such as data aggregation and data compression, to conserve energy. Security incorporation in WSNs has several aspects, some of which are discussed below.

Time Synchronization: Due to the collaborative nature of sensor nodes, secure time synchronization is important for various WSN operations, such as coordinated sensing tasks, sensor scheduling, target tracking, data aggregation, and time division multiple access (TDMA) medium access control. The network time protocol (NTP), which is used for synchronization in the Internet, cannot be directly used by WSNs due to a sensor node's constrained resources. Sometimes, synchronization protocols for WSNs have been proposed in the literature such as reference-broadcast synchronization (RBS) [139] and timing-sync protocol for sensor networks (TPSN) [140]; however, security is not considered in the design of these synchronization protocols.

Sensor Location Discovery: Sensor node location is important for many WSN applications, such as target tracking and environmental monitoring. The geographical routing protocols developed for WSNs also require sensor node location information to make routing decisions. Sensor location discovery protocols make use of special nodes known as beacon nodes, which are assumed to know their own location via global positioning system (GPS) or manual configuration. Nonbeacon nodes receive radio signals (reference signals) from beacon nodes. These reference signals contain the location of beacon nodes. The nonbeacon nodes then determine their own location based on features of the reference messages, such as using received signal strength indicator (RSSI). Without any protection, the adversary has the potential to mislead the location estimation at sensor nodes. For example, an attacker may provide incorrect location references by replaying the beacon signals intercepted in different locations. A more serious attack launched by an adversary could be compromising a beacon node and distributing malicious location references by manipulating the beacon signals (e.g., changing the signal strength if RSSI is used to estimate the distance). In the presence of such attacks, nonbeacon nodes will determine their location incorrectly. There are a few methods to counter these attacks and to increase the reliability and trustworthiness of WSN. For instance, voting-based location references use voting of different received location references from different beacon nodes; however, this technique would increase the cost of the WSN as the approach would require more beacon nodes equipped with GPS.

Secure Routing: An adversary can also impact routing within a WSN, which consequently degrades the reliability of the WSN. Secure routing protocols for MANETs or wired networks are not suitable for WSNs because these protocols are computationally expensive and are meant for routing between any pair of nodes and not for the many-to-one traffic pattern prevalent in WSNs.

Intrusion Detection: Most of the existing WSNs are designed without considering security and intrusion detection. WSNs are more prone to malicious intruders as compared to traditional wired networks due to an open medium, a low degree of physical security, the dynamic topology, and a limited power supply. Security vulnerabilities in WSNs can be alleviated using an intrusion detection system (IDS). Intrusion detection is a security technology that attempts to identify those who are trying to break into and misuse a system without authorization as well as those who have legitimate access to the system but are abusing their privileges [141]. By modeling the behavior of normal sensor node activities and operations, an IDS can identify potential intruders and maintain trustworthy operation of a WSN. IDSs are of two types: misuse-based detection and anomaly-based detection. A misuse-based IDS encodes known attack signatures and vulnerabilities and stores this information in a database. An alarm is generated by the IDS if a discrepancy exists between the current activities and stored signatures. A misuse-based IDS cannot detect novel attacks because of the lack of corresponding signatures. An anomaly-based IDS creates normal profiles of sensor node behavior and compares them with current activities. If a significant deviation in a sensor node's behavior is observed, the anomaly-based IDS raises an alarm. An anomaly-based IDS can detect unknown attacks; however, developing profiles of a sensor node's normal behavior is challenging.

4.7.7 WSN Design and Tuning for Reliability

WSNs are more susceptible to failure than wired networks due to a WSN's deployment in unattended and often hostile environments. The vulnerability of sensor nodes to failures and attacks from malicious intruders requires reliability and security incorporation in the design of sensor nodes and WSNs. For instance, in the case of failures of some sensor nodes or some sensor nodes being compromised by an adversary, a WSN should be able to detect these failures and then adapt accordingly to remain operational. For example, if a few nodes are diagnosed as misbehaving or failed, routing protocol should be able to adapt to the situation and find alternative routes. This adaptation may require some sensor nodes to increase their transmission energy to get their sensed information to healthy nodes. Furthermore, the frequency of packet transmission can be reduced in such situations to conserve energy of sensor nodes to compensate for the increased transmission energy. This tuning of sensor nodes and the WSN would maintain the reliable operation of WSN although at a reduced efficiency. Moreover, the WSN manager should be alarmed of the situation to take necessary actions to increase the efficiency of the WSN, including removal of identified malicious nodes and placement of new healthy sensor nodes in the WSN.

4.7.8 Novel WSN Architectures

To maintain reliable and secure operation of WSNs in the presence of failures and malicious intruders, innovative WSN architectures are desired. For instance, a heterogeneous hierarchical multicore embedded wireless sensor network (MCEWSN) can better meet performance and reliability requirements of WSN [142]. The heterogeneity in the architecture subsumes the integration of numerous single-core embedded sensor nodes and several multicore embedded sensor nodes. A hierarchical architecture comprising of various clusters and a sink node is appropriate for large WSNs since in small WSNs, sensor nodes can send the sensed data directly to the sink node. Each cluster comprises several leaf sensor nodes and a cluster head. Leaf sensor nodes embed a single-core processor and are responsible for sensing, preprocessing, and transmitting the sensed data to the cluster head nodes. Cluster head nodes embed a multicore processor and are responsible for coalescing/fusing the data received from leaf sensor nodes for transmission to the sink node. Multicore embedded sensor nodes enable energy savings over conventional single-core embedded sensor nodes in two ways. First, multicore embedded sensor nodes reduce the energy expended in communication by performing in situ computation of sensed data and transmitting only processed information. Second, a multicore embedded sensor node permits the computations to be split across multiple cores while running each core at a lower processor voltage and frequency, as compared to a single-core system, which results in energy savings. Multicore cluster heads are also amenable for supervising a fault diagnosis algorithm within the cluster. Additionally, security primitives could be integrated in these multicore cluster heads to thwart attacks from malicious intruders or malicious sensor nodes.

4.8 Chapter Summary

This chapter provided a comprehensive research on modeling and analysis of fault detection and tolerance in WSNs. To elucidate fault detection in WSNs, we provided a taxonomy for fault diagnosis in WSNs. We simulated prominent fault detection algorithms using ns $c04-math-0619$ 2 to determine the algorithms' accuracies and false alarm rates. We further analyzed the effectiveness of fault detection algorithms under conditions modeled from real-world data. We proposed an FT duplex sensor node model based on the novel concept of determining the coverage factor using a sensor node's fault detection algorithm's accuracy. We developed comprehensive Markov models that hierarchically encompass sensor nodes, WSN clusters, and the overall WSN. Our Markov models characterized WSN reliability and MTTF for different sensor failure probabilities. Our models could assist WSN designers to better meet application requirements by determining the reliability and MTTF in the predeployment phase.

Our ns $c04-math-0620$ 2 simulation results of fault detection algorithms suggested that both the simulated and real-world data needed to be considered for rigorous evaluation of fault detection algorithms. Although some algorithms performed better than other algorithms on simulated data, these algorithms performed inferior to other algorithms on real-world data. We observed that the fault detection algorithm's accuracy played a crucial role in FT WSNs by comparing our results against a perfect fault detection algorithm ( $c04-math-0621$ ). The results indicated that the percentage improvement in reliability for an FT sensor node with $c04-math-0622$ over an NFT sensor node is 230% and the percentage improvements in reliability for an FT sensor node with $c04-math-0623$ over an FT sensor node ( $c04-math-0624$ ) with an average number of neighbors $c04-math-0625$ equal to 5, 10, and 15 was 172.36%, 166.31%, and 165.32%, respectively, for $c04-math-0626$ . The percentage improvement in reliability for an FT WSN cluster with $c04-math-0627$ over an NFT WSN cluster and an FT WSN cluster with $c04-math-0628$ was 601.12% and 142.73%, respectively, for $c04-math-0629$ . The percentage improvement in reliability for an FT WSN with $c04-math-0630$ over an NFT WSN and an FT WSN with $c04-math-0631$ was 350.18% and 236.22%, respectively, for $c04-math-0632$ .

Results indicated that our FT model could provide on average a 100% MTTF improvement with a perfect fault detection algorithm, whereas the MTTF improvement varied from 95.95% to 1.34% due to a fault detection algorithm's typical poor performance at high sensor failure rates. We also observed that the redundancy in WSNs plays an important role in improving WSN reliability and MTTF. Results revealed that just three redundant sensor nodes in a WSN cluster resulted in an MTTF improvement of 103.35% on average. Similarly, redundancy in WSN clusters contributes to the reliability and MTTF improvement, and the results indicated that three redundant WSN clusters could improve the MTTF by 52.22% on average. The iso-MTTF results indicated that an NFT WSN cluster required three redundant sensor nodes to attain a comparable MTTF as that of an FT WSN cluster. Iso-MTTF results further indicated that an NFT WSN required nine redundant WSN clusters to achieve a comparable MTTF as that of an FT WSN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Probability	NFT	FT	FT	FT	FT
$c04-math-0327$	$c04-math-0328$	$c04-math-0329$	$c04-math-0330$	$c04-math-0331$	$c04-math-0332$
0.05	0.94999	0.99770	0.99872	0.99872	0.99872
0.1	0.89996	0.98135	0.99074	0.99102	0.99482
0.2	0.80011	0.93481	0.95088	0.95195	0.97853
0.3	0.69977	0.86265	0.88263	0.88513	0.94959
0.4	0.59989	0.77094	0.79209	0.79485	0.90643
0.5	0.5007	0.66086	0.68097	0.68374	0.84662
0.6	0.40012	0.53610	0.55295	0.55552	0.76663
0.7	0.30119	0.40167	0.41432	0.41612	0.66262
0.8	0.19989	0.25943	0.26683	0.26812	0.52171
0.9	0.10026	0.12148	0.12424	0.12470	0.33086
0.99	0.01005	0.01048	0.01053	0.01056	0.05628

$c04-math-0452$	NFT	FT $c04-math-0453$	FT $c04-math-0454$
0.05	0.96720	0.99058	0.99098
0.1	0.88562	0.95969	0.96552
0.2	0.65567	0.84304	0.87474
0.3	0.41968	0.66438	0.74422
0.4	0.23309	0.45925	0.59388
0.5	0.10942	0.26595	0.43653
0.6	0.04098	0.11837	0.28732
0.7	0.01114	0.03548	0.16279
0.8	$c04-math-0455$	$c04-math-0456$	0.06723
0.9	$c04-math-0457$	$c04-math-0458$	0.01772
0.99	$c04-math-0459$	$c04-math-0460$	$c04-math-0461$

$c04-math-0551$	NFT	FT $c04-math-0552$	FT $c04-math-0553$
0.05	0.99557	0.99883	0.99885
0.1	0.98261	0.99474	0.99534
0.2	0.93321	0.97583	0.98084
0.3	0.85557	0.93775	0.95482
0.4	0.75408	0.87466	0.91611
0.5	0.63536	0.78202	0.86218
0.6	0.51166	0.65121	0.78948
0.7	0.36303	0.49093	0.69527
0.8	0.20933	0.30328	0.55494
0.9	0.08807	0.11792	0.39647
0.99	$c04-math-0554$	$c04-math-0555$	0.08807