5.1. Introduction and Background

There is no single agreed-upon definition for system survivability. Instead, one may use as a starting point the vague notion that a system has to be able to tolerate diverse faults. This includes those faults typically considered in the area of fault-tolerant system design, such as faults resulting from component failure as a consequence of aging, fatigue or breakdown of materials. These faults may exhibit very predictable behavior and frequency. However, in the last decade there has been much attention on humanly-induced malicious faults (e.g., hacking, denial of service, virus, Trojan horses, spoofing). These kind of faults may be totally unpredictable. Before elaborating on the predictability of faults, or lack thereof, let us investigate some definitions of system survivability.

Definitions of system survivability can be partitioned into qualitative and quantitative definitions [1]. Survivability definitions vary even within the same application domain. For example, the American Institute of Aeronautics and Astronautics (www.aiaa.org) defines aircraft combat survivability as “the capability of an aircraft to avoid or withstand a man-made hostile environment.” Whereas this definition is qualitative, its measurement is quantitative, that is, it can be measured by the probability the aircraft survives an encounter (combat) with the environment. A more general definition of aircraft survivability is the capability of an aircraft to avoid or withstand hostile environments, including both man-made and naturally occurring environments, such as lightning strikes, mid-air collisions, and crashes. A qualitative definition that has been used extensively was introduced by Ellison et al. [2], defining “survivability as the capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failure, or accidents.” This definition of survivability is refining the mission-oriented notion of survivability that dates back to the 1960s in the context of mission reliability as seen in MIL-STD-721 or DOD-D 5000.3. The Ellison definition has been the basis for several procedural approaches to enhance survivability based on the concept of Survivability Network Analysis (SNA), introduced by Mead et al. [3]. Similarly, Neumann [4] states: “survivability is the ability of a computer-communication system-based application to satisfy and to continue to satisfy certain critical requirements (e.g., specific requirements for security, reliability, real-time responsiveness, and correctness) in the face of adverse conditions. Survivability must be defined with respect to the set of adversities that are supposed to be withstood.” Survivability of software systems has been defined by Deutsch [5] as “the degree to which essential functions are still available although some part of the system is down.”

Qualitative definitions of survivability, such as those indicated above, have been identified as useful in conveying the general notion of survivability, however, they are less suitable to measure survivability, compare the survivability of different systems, or measure the impact of efforts to increase survivability. For example, the SNA shown by Mead [3] helps to identify which parts of a system are deemed essential and results in recommended actions to increase survivability; but there is no direct way to measure the benefits of individual recommendations. This suggests the need for quantitative definitions.

A formal definition of system survivability was given by Knight [6, 7] where “a system is survivable if it complies with its survivability specification.” The survivability specification is then defined as an n-tuple. Specifically, in [6]:

The survivability specification is given in a four-tuple (E, R, P, M), where:

  • Environment E is a definition of the environment in which the survivable system has to operate. It includes details of the various hazards to which the system might be exposed together with all of the external operating parameters. To the extent possible, it must include any anticipated changes that might occur in the environment.

  • Specification set R is the set of specifications of tolerable forms of service for the system. This set will include one distinguished element that is the normal or preferred specification, i.e., the specification that provides the greatest value to the user and with which the system is expected to comply most of the time. It is worth noting that at the other extreme, a completely inert system, i.e., no functionality at all, might be a tolerable member of this set.

  • Probability Distribution P; A probability is associated with each member of the set R with the sum of these probabilities being one. The probability associated with the preferred specification defines the fraction of operating time during which the preferred specification must be operational. The probabilities associated with the other specifications are upper bounds and define the maximum fractions of operating time that the associated specifications can be operational.

  • A finite-state machine M denoted by the four-tuple (S, s0, V, T), where S is a finite set of states, each of which has a unique label, which is one of the specifications in R, s0 is the initial (or preferred) state of the machine, V is the finite set of customer values, and T is the state transition matrix.

In this definition, probabilities are assumed with each state and system survivability is defined by the probability of being in a preferred state. The reader is referred to the articles by Knight [6, 7] for more details and examples.

A definition that focuses on what happens after a fault occurs is given by the T1A1.2 group and is stated by Liu and Trivedi [1]: “Suppose a measure of interest M has the value m0 just before a failure occurs. The survivability behavior can be depicted by the following attributes: ma is the value of M just after the failure occurs, mu is the maximum difference between the value of M and ma after the failure, mr is the restored value of M after some time tr, and tR is the time for the system to restore the value of m0.” The interesting aspect of this definition is that the actual time of the fault is of no importance. Thus, the definition addresses the state of a system just before a fault and captures the impact of the fault (i.e., the behavior of the system after the fault occurred).

Whereas it is important to understand the differences and expressive powers of the individual definitions above, we shall not be too pedantic about definitions and rather focus on general strategies to understand the impact of faults (i.e., fault models and the implications of these models on recovery and thus survivability).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset