Risk Management and Disaster Recovery in the Cloud

Saman Zonouz

Rutgers University, New Brunswick, NJ, USA

7.1 Introduction

Keeping cloud infrastructures, systems, and networks secure is a continual race against attackers. The growing number of security incidents indicates that current approaches to building systems do not sufficiently address the increasing variety and sophistication of threats and do not block attacks before systems are compromised. Organizations must resort to trying to detect malicious activity that occurs, so efficient intrusion detection systems (IDSs) are deployed to monitor systems and identify misbehavior. However, IDSs alone are not sufficient to allow operators to understand the security state of their organization, because monitoring sensors usually report all potentially malicious traffic without regard to the actual network configuration, vulnerabilities, and mission impact. Moreover, given large volumes of network traffic, IDSs with even small error rates can overwhelm operators with false alarms. Even when true intrusions are detected, the actual mission threat is often unclear, and operators are unsure what actions they should take. Security administrators need to obtain updated estimate summaries regarding the security status of their mission‐critical assets precisely and continuously, based on alerts that occur, in order to respond effectively to system compromises and prioritize their response and recovery actions. This requirement is even stronger in the context of smart energy infrastructure where incorrect decision‐making related to the security of process controls can have dramatic consequences.

7.2 Background

Extensive research has been conducted over the past decade on the topics of system situational awareness and security metrics. Security metrics and evaluation techniques fall into two categories. First, with static solutions, an IDS alert scoring value is hard‐coded on each detection rule; the (alert, score) mappings are stored in a lookup table to be used later to prioritize alerts. The advantages of static techniques are their simplicity and their rapidity. However, they suffer from a lack of flexibility, mainly because they completely ignore system configuration and scalability, since it is infeasible to predict all the alert combinations from IDSs in a large‐scale network.

Second, there are dynamic methods, which are mostly based on attack‐graph analysis. The main idea is to capture potential system vulnerabilities and then extract all possible attack paths. The generated graph can be used to compute security metrics and assess the security strength of a network. These techniques can also be used predictively to rank IDS alerts. In particular, Topological Vulnerability Analysis (TVA) matches the network configuration with an attack simulation in order to optimize IDS sensor placement and to prioritize IDS alerts. The primary issue with attack‐graph‐based techniques is that they require important assumptions about attacker capabilities and vulnerabilities. While these approaches are important in planning for future attack scenarios, we take a different perspective by relying on past consequences, actual security requirements, and low‐level system characteristics, such as file and process dependencies, instead of hypothetical attack paths. As a result, our method is defense‐centric rather than attack‐centric and does not suffer from the issues of unknown vulnerabilities and incomplete attack coverage.

Defense‐centric approaches, on the other hand, use manually filled knowledge bases of alert applicability, system configuration, or target importance to associate a context with each alert and to provide situational awareness accordingly. Damage‐assessment capabilities have previously been explored via file‐tainting analysis for malware detection, for offline forensic analysis using backtracking or forward‐tracking techniques, and for online damage situational awareness.

The current techniques for the security‐state estimation problem generally fall short in two major respects. First, existing solutions rely heavily on human knowledge and involvement. The system administrator should observe the triggered IDS alerts (possibly in a visual manner) and manually evaluate their criticality, which can depend on the alerts' accuracy, the underlying system configuration, and high‐level security requirements. As the size of cloud infrastructures and their networks increases, the manual inspection of alerts usually becomes very tedious, if not impossible, in practice. Requiring extensive human knowledge, the current model‐based approaches try to compute security metrics based on a manually designed model and a strong set of assumptions about attackers' behaviors and the vulnerabilities within the system.

Second, previous techniques for IDS alert correlation and system security state1 estimation usually focus only on the attack paths and subsequent privilege escalations, without considering dependencies between system assets. In doing so, they define the security metric of a given system state to be the least required number of vulnerability exploitations (i.e. privilege escalations) needed to get from that state to the goal state in which the attacker gains the privileges necessary to cause the final malicious consequence (e.g. a sensitive file modification). We call these attack‐centric metrics. Therefore, regardless of the transitions, this type of metric is not defined for a non‐goal state. Equivalently, attack‐metric definitions are created with the assumption that all attackers will pursue exploitations until they get to the goal state, which is insecure by definition. However, in practice, there are often unsuccessful attacks that cause partial damage to systems, such as a web server crash as the result of an unsuccessful buffer‐overflow exploitation. Hence, it is important to consider not only future vulnerability exploitations, but also the damage already caused by the attacker.

7.3 Consequence‐Centric Security Assessment

To address various limitations of past solutions, we introduce an information flow‐based system security metric called Seclius. Seclius works by evaluating IDS alerts received in real time to assess how much attackers affect the system and network asset security. This online evaluation is performed using two components: (i) a dependency graph and (ii) a consequence tree. These two components are designed to identify the context required around each IDS alert to accurately assess the security state of the different assets in the organization.

Specifically, the dependency graph is a Bayesian network automatically learned during a training phase when the system was behaving normally. The dependency graph captures the low‐level dependencies among all the files and processes used in the organization. The consequence tree is a simple tree structure defined manually by administrators to formally describe at a high level the most critical assets in the organization. When a new IDS alert is received, a belief propagation algorithm, the Monte Carlo Gibbs sampling, combines the dependency graph and the consequence tree to calculate online the probability that the critical assets in the organization have not been affected and are still secure. Consequently, Seclius assesses organizational security using a bottom‐up logical propagation of the probabilities that assets are or are not compromised.

Seclius minimizes reliance on human expertise by clearly separating high‐level security requirements from the low‐level system characteristics of an IT infrastructure. We developed an algorithm and a set of instruments to automatically learn the dependency graph, which represents the system characteristics by capturing information flows between files and processes, within virtual machines (VMs) of the cloud infrastructure and across the network. As a result, administrators are not required to define such low‐level input, so they can focus on identifying high‐level organizational security requirements using the consequence tree. These requirements are most often subjective and cannot be automatically discovered. In practice, even in large organizations, the consequence tree contains very few assets, e.g. a web server and a database, and does not require detailed system‐level expertise. In addition, as a defense‐centric metric, Seclius assesses system security by focusing solely on past consequences, and hence it assumes nothing about system vulnerabilities and attackers' future behaviors, e.g. possible attack paths.

It is worth emphasizing that we do not provide an intrusion‐detection capability per se; instead, Seclius assesses organizational security based on the set of alerts triggered by underlying IDSs. Therefore, Seclius would not update the security measure if an attack were not detected by an IDS. Furthermore, it is not an intrusion‐response system and does not, as in our experiments, explicitly respond to attacks, but instead helps administrators or response systems react by providing situational awareness capabilities.

7.3.1 High‐Level Overview of Cloud Risk Assessment and Disaster Recovery

We first describe how attack‐centric and defense‐centric security metric are distinguished. In past related work on attack‐centric security metrics, e.g. attack graph‐based techniques, where there is one (or more) goal states, the security measure of a non‐goal state in the attack graph is not independently calculated, i.e. it is calculated as a function of “how close” the non‐goal state is to the goal state (from an attack‐centric viewpoint, how much is left to reach the destination). Therefore, from an attack‐centric viewpoint, if the attack stops in a non‐goal state, the attacker gains nothing (according to the model); however, from a defense‐centric viewpoint, the system may have already been affected through past exploitations on the attacker's way to get to the current non‐goal state. For instance, let us assume that the attacker's end goal is to read a sensitive file residing in a back‐end database server; on the attacker's way in the current non‐goal state, a buffer‐overflow vulnerability in a web server system is exploited, resulting in a server process crash (unavailable web server). According to the attacker's objective, the attacker has not gained anything yet; however, the defense‐centric system security has been affected because a network functionality is lost (web server crash). Consequently, the defense‐centric security measure of a non‐goal state, independent of the goal state and regardless of whether the attack will succeed, may not be the highest value (1 in this case).

Seclius's high‐level goal (Figure 7.1) is to assess the security of each possible system state with minimal human involvement. In particular, we define the security of a system state as a binary vector in which each bit indicates whether a specific malicious event has occurred in the system. We consider two types of malicious events. First there are vulnerability exploitations, which are carried out by an attacker to obtain specific privileges and improve the attacker’s control over the system. Therefore, the first set of bits in a state denotes the attacker's privileges in that state, e.g. root access on the web server VM. Those bits are used to determine what further malicious damage the attacker can cause in that state. Second, there are attack consequences, which are caused by the attacker after they obtain new privileges. Specifically, we defined consequences as the violations of the CIA criteria (confidentiality, integrity, and availability) applied to critical assets in the organization, such as specific files and processes. For example, the integrity of a file F2, denoted by I(F2), is compromised if F2 is either directly or indirectly modified by the attacker.

Diagram illustrating high-level architecture for security risk assessment and disaster recovery with arrows linking boxes labeled administration, instrumented systems, critical components, etc.

Figure 7.1 High‐level architecture for security risk assessment and disaster recovery.

The security of any given system is characterized by a set of its identifiable attributes, such as security criteria of the critical assets, e.g. integrity of a database file; the notion of a security metric is defined as a quantitative measure of how much of that attribute the system possesses. Our goal is to provide administrators with a framework to compute such measures in an online manner. We believe that there are two major barriers to achieving this goal. First, while critical assets are system‐specific and should be defined by administrators, a framework that requires too much human involvement is prone to errors and has limited usability. As a result, a formalism is needed so that administrators can define assets simply and unambiguously. Second, low‐level IDS alerts usually report on local consequences with respect to a specific domain. Consequently, we need a method that provides understanding of what the low‐level consequences represent in a larger context and quantifies how many of the security attributes the whole system currently possesses, given the set of triggered alerts.

The Seclius framework addresses those two challenges. Seclius works based on two inputs: a manually defined consequence tree and an automatically learned dependency graph. The hierarchical, formal structure of the consequence tree enables administrators to define critical assets easily and unambiguously with respect to the subjective mission of the organization. The dependency graph captures the dependencies between these assets and all the files and processes in the system during a training period. In production mode, Seclius receives low‐level alerts from intrusion‐detection sensors and uses a taint‐propagation analysis method to evaluate online the probabilities that the security attributes of the critical assets are affected.

Regarding how large a consequence tree can become in a real‐world setting, we highlight that the consequence tree needs to include only the primary critical assets. For instance, in a banking environment, the credit card database would be labeled as critical. However, the consequence tree does not have to include indirectly critical assets, which are defined as assets that become critical because of their (possible) interaction with other critical assets. For instance, a web server that interacts with a credit card database would also be partially considered as critical but would not need to be labeled as such. We emphasize that the number of primary critical assets is usually very low even for large target infrastructures. All inter‐asset dependencies and system‐level details are captured by the dependency graph that is generated and analyzed automatically. The small size of the manually constructed consequence tree and the automated generation of the dependency graph improves the scalability of Seclius remarkably, as shown in the experiments.

To further clarify the conceptual meaning of the system security measure in Seclius, here we describe in more detail what it represents mathematically. The security measure value that Seclius calculates represents the probability that the system is “insecure” according to the consequence tree designed by the administrator. For instance, if the consequence tree includes a single node “credit‐card‐numbers.db confidentiality,” the ultimate system security measure value will denote the probability that the confidentiality of “credit‐card‐numbers.db” has been compromised at any given state. Consequently, the calculated measures range continuously between 0 and 1, inclusive. In our implementations, we used discretization from [0,1] to the set {low, medium, high} to indicate when the system's security is high or when it needs to be taken care of. Seclius combines these two components to measure the security of the system as the probability that the critical assets become directly or indirectly affected by intruders. The critical assets, encoding the security requirements, are defined by system administrators. The probability value is evaluated using cross dependencies among the system assets. The dependencies are captured during an offline learning phase, but the probability evaluation works online and is triggered by security alerts received from sensors and IDSs.

To illustrate how Seclius works in practice, consider a scenario in which the IT infrastructure of a cloud provider is instrumented to enable comprehensive security monitoring of systems and networks. To use Seclius on top of the deployed IDSs, administrators would first define critical assets and organizational security requirements through the consequence tree. We emphasize that this task does not require deep‐knowledge expertise about the IT infrastructure. In our example, the critical assets would include end‐user VMs and a database server. In particular, the security requirements would consist of the availability of the VMs and the integrity and confidentiality of the database files.

The second step would be to run a training phase with no ongoing attack in order to collect data on intra‐ and inter‐VM dependencies between files and processes. After a few hours, the results of this training phase would be automatically stored in the dependency graph, and the instruments used to track the dependencies would be turned off.

The third and final step would be to run Seclius in production mode. Seclius starts processing alerts from IDSs by using the generated dependency graph and probabilistically determines whether the critical assets are compromised. After a training phase of a few hours with no ongoing attacks, where the Seclius instruments have collected intra‐ and inter‐VM dependencies between files and processes, the instruments would be turned off. The framework would be ready to process alerts from IDSs and probabilistically determine if the critical assets are compromised. If a vulnerability exploitation in the customer web server was detected by an IDS, Seclius would update not only the security measure of the corresponding web server, but also that of the set of systems that depend on the web server (e.g. the back‐end database), according to the learned dependency graph. By observing the real‐time IDS alerts and the learned inter‐asset dependencies, Seclius can precisely measure (i) the privileges gained by the attacker and which security domains the attacker was able to reach; and (ii) how the integrity, confidentiality, or availability of the assets were affected by the exploit directly or indirectly. In the next sections, we further describe the mathematical tools and formalism used by the various components of Seclius to provide that information.

7.3.2 Cloud Security Consequence Tree

In this section, we discuss the first manual input required by Seclius: the consequence tree (CT). The goal of the CT is to capture critical IT assets and organizational security requirements. The criticality level of individual assets within an organization is indeed an environment‐specific issue. In other words, the criticality levels heavily depend on organizational missions and/or business‐level objectives. For instance, consider a security‐critical cloud infrastructure network, whose mission is to provide millions of end users with crucial data‐storage and computational services. In such an environment, provision of high‐availability guarantees for a server VM sample, which happens to be a critical cyber asset of the enterprise, is often much more critical than for a logging database, which is used to store historical system incidents for later analyses. Hence, Seclius requires system administrators to manually provide the list of organizational critical assets.

The critical assets could be provided using any function: a simple list (meaning that all items are equally important), a weighted list, or a more complex combination of assets. In this chapter, we use a logical tree structure. We believe that it offers a good trade‐off between simplicity and expressiveness, and the fact that it can be represented visually makes it a particularly helpful resource for administrators. The formalism of the CT follows the traditional fault‐tree model; however, unlike fault trees, the leaf nodes of the CTs in Seclius address security requirements (confidentiality, integrity, and availability) of critical assets, rather than dependability criteria.

The CT formalism consists of two major types of logical nodes: AND and OR gates. To design an organizational CT, the administrator starts with the tree's root node, which identifies the primary high‐level security concern/requirement, e.g. “Organization is not secure.” The rest of the tree recursively defines how different combinations of the more concrete and lower‐level consequences can lead to the undesired status described by the tree's root node. The recursive decomposition procedure stops once a node explicitly refers to a consequence regarding a security criterion of a system asset, e.g. “availability of the Apache server is compromised.” These nodes are in fact the CT's leaf consequence nodes, each of which takes on a binary value indicating whether its corresponding consequence has happened (1) or not (0). Throughout the chapter, we use the C, I, and A function notations to refer to the CIA criteria of the assets. For instance, C(F2) and I(P6) denote the confidentiality of file F2 and integrity of process P6, respectively. The leaves' values can be updated by IDSs, e.g. Samhain. The CT is derived as a Boolean expression, and the root node's value is consequently updated to indicate whether organizational security is still being maintained.

A CT indeed formulates subjective aspects of the system. Its leaf nodes list security criteria of the organization's critical assets. Additionally, the CT implicitly encodes how critical each asset is, using the depth of its corresponding node in the tree; that is, the deeper the node, the less critical the asset is in the fulfillment of organizational security requirements. Furthermore, the CT formulates redundant assets using AND gates. Seclius requires administrators to explicitly mention redundancies because it is often infeasible to discover redundancies automatically over a short learning phase.

Although the CT formulation can be considered as a particular kind of the general attack‐tree formalism, its application in Seclius is different from how attack trees have been used typically in the past: CTs formulate how past consequences contribute to the overall security requirements, whereas attack trees usually address attackers' potential intents against the organization. In other words, at each instant, given the consequences already caused by the attackers in the system, Seclius employs the CT to estimate the current system security, whereas the system's attack tree is often used to probabilistically estimate how an attacker could or would penetrate the system in the future. Therefore, the consequence tree formalism mainly considers adversarial consequences against assets, such as a web server process, needed for the organization to fulfill its mission objectives, and may not address vulnerability exploitations that cause privilege escalations for the attackers without affecting the system's current performance.

7.3.3 Cloud‐Based Dependency Graph

As mentioned in the previous section, the CT captures only subjective security requirements and does not require deep‐knowledge expertise about the IT infrastructure, thanks to the dependency graph (DG). The goal of the DG is to free the administrator from providing low‐level details about the organization. Those details are automatically captured by Seclius through a learning phase, during which the interactions between files and processes are tracked in order to probabilistically identify direct or indirect dependencies among all the system assets. For instance, in a database server, the administrator only needs to list the sensitive database files, and Seclius later marks the process “mysqld” as critical because it is in charge of reading and modifying the databases. Such a design greatly reduces the resources and time spent by administrators in deploying Seclius.

Each vertex in the DG (Figure 7.2) represents an object—a file, a process, or a socket—and the direct dependency between two objects is established by any type of information flow between them. For instance, if data flows from object o_i to o_j, then object o_j becomes dependent on o_i; the dependency is represented by a directed edge in the DG, i.e. from o_i to o_j. To capture that information, Seclius intercepts system calls (syscalls) and logs them during the learning phase. In particular, we are interested in the syscalls that cause data dependencies among OS‐level objects. A dependency relationship is stored by three elements: a source object, a sink object, and their security contexts.

Concept map illustrating generated dependency graph through system-call interception, with arrows linking ellipses labeled gomme-screensav, dropbox, indicator-me-se, pulse audio, gnome-panel, etc.

Figure 7.2 Generated dependency graph through system‐call interception.

The security context can be object privileges, or, if SE‐Linux is deployed, a security type. We classify dependency‐causing events based on the source and sink objects for the dependency they induce: process‐process, process‐file, and process‐filename. The first category of events includes those for which one process directly affects the execution of another process. One process can affect another directly by creating it, sharing memory with it, or sending a signal to it. The second category of events includes those for which a process affects or is affected by data or attributes associated with a file. Syscalls like write and writev cause a process‐to‐file dependency, whereas syscalls like read and readv cause a file‐to‐process dependency. The third category of events is for processes affecting or being affected by a filename object. Any syscall that includes a filename argument, e.g. open, causes a filename‐to‐process dependency, as the return value of the syscall depends on the existence of that filename in the file system directory tree. When the learning phase is over, syscall logs are automatically parsed and analyzed line by line to generate the DG. Each dependence edge is tagged with a frequency label indicating how many times the corresponding syscalls were called during the execution.

We make use of the Bayesian network formalism to store probabilistic dependencies in the DG; a conditional probability table (CPT) is generated and associated with each vertex. This CPT encodes how the information flows through that vertex from its parents (sources of incoming edges) to its children. For example, if some of the parent vertices of a vertex become tainted directly or indirectly by attacker data, the CPT in the vertex saves the probability that the vertex (specifically, the OS‐level object represented by the vertex) also is tainted. More specifically, each DG vertex is modeled as a binary random variable (representing a single information flow), equal to either 1 (true) or 0 (false) depending on whether the vertex has been tainted; the CPT in a vertex v stores the probability that the corresponding random variable will take the true value (v = 1), given the binary vector of values of the parent vertices P(v).

Figure 7.3 illustrates how a CPT for a single flow (with 1‐bit random variables) is produced for a sample vertex, i.e. the file F4. The probabilities on the edges represent probability values. For instance, process P1 writes data to files F4 and F7 with probabilities 0.3 and 0.7, respectively. As shown in the figure, the file F4 cannot become tainted if none of its parents are tainted, i.e. Pr(F4 |! P1,! P9) = 0. If only process P1 is tainted, F4 can become tainted only when information flows from P1, i.e. Pr(F4 | P1,! P9) = 0.3. If both of the parents are already tainted, then F4 would be tainted when information flows from either of its parent vertices. In that case, the probability of F4 being tainted would be the complement probability of the case when information flows from none of its parents. Therefore, Pr(F4 | P1, P9) = 1−(1−0.3)*(1−0.8) = 0.86.

Schematic with arrows labeled 0.7 and 0.3 from P1 pointing to F7 and F4, respectively, from P9 to F4 (0.8) and F2 (0.2), and from F4 to P6. P1, P6, P9, F2, F4, and F7 are inside a circle.

Figure 7.3 Conditional probability table construction.

Each CT leaf node that represents a CIA criterion of a critical asset is modeled by Seclius as an information flow between the privilege domains controlled by the attacker (according to the current system state) and those that are not yet compromised. Confidentiality of an object is compromised if information flows from the object to any of the compromised domains. Integrity of an object is similarly defined, but the flow is in the reverse direction. Availability is not considered an information flow by itself; however, an object's unavailability causes a flow originating at the object, because once an object becomes unavailable, it no longer receives or sends out data as it would if it was not compromised. For instance, if a process frequently writes to a file, then once the process crashes, the file is not modified by the process, possibly causing inconsistent data integrity; this is modeled as a propagation of tainted data from the process to the file. We consider all the leaf nodes that concern the integrity criterion of critical assets as a single information flow, because they conceptually address the same flow from any of the compromised domains to the assets. However, confidentiality flows cannot be grouped, as they originate individually at separate sources.

If each information flow is represented as a bit, then to completely address “n” concurrent information flows, we define the random variable in each vertex as an “n”‐bit binary vector in which each bit value indicates whether the vertex is already tainted by the bit's corresponding flow. In other words, to consider all the security criteria mentioned in a CT with “n” leaves, every vertex represents an “n”‐bit random variable (assuming integrity bits are not grouped), where each bit addresses a single flow (i.e. a leaf node). The CPTs are generated accordingly; a vertex CPT stores the probability of the vertex's value given the value of its parents, each of which, instead of true or false, can take on any “n”‐bit value.

7.3.4 Cloud Security Evaluation

Given the DG generated during the learning phase, the operator turns off the syscall interception instruments and puts the system in production mode. The learned DG is then used in an online manner to evaluate the security of any system security state. The goal of this section is to explain how this online evaluation works in detail. We first assume that the IDSs report the exact system state with no uncertainty. We discuss later how Seclius deals with IDS inaccuracies.

At each instant, to evaluate the security of the system's current state “s,” DG vertices are first updated according to “s,” which indicates the attacker's privileges and past consequences (CT's leaf nodes). For each consequence in “s,” the corresponding flow's origin bit in DG is tainted. For instance, if file F4 is modified by the attacker (integrity compromise), the corresponding source bit in DG is set to 1 (evidence bit).

The security measure for a given state “s” is defined to be the probability that the CT's root value is still 0 (Pr(!root(CT) | s)), which means organizational security has not yet been compromised. More specifically, if the CT is considered as a Boolean expression, e.g. CT = (C(F10) AND A(P6)) OR I(F2), Seclius calculates the corresponding marginal joint distribution, e.g. Pr[(C(F10) AND A(P6)) OR I(F2)], conditioned on the current system state (tainted evidence vertices).

Seclius estimates the security of the state “s” by calling a belief propagation procedure—the Gibbs sampler—on the DG to probabilistically estimate how the tainted data (evidence bits) are propagated through the system while it is in state “s.”

Generally, the Gibbs sampler algorithm is a Monte Carlo simulation technique that generates a sequence of samples from a joint probability distribution of two or more random variables X1, X2, …, Xn. The purpose of such a sequence in Seclius is to approximate the joint distribution numerically using large number of samples. In particular, to calculate a joint distribution Pr(X1, X2, …, Xn | e1, …, em), where ei represents an evidence, the Gibbs sampler runs a Markov chain on X = (X1, X2, …, Xn) by (i) initializing X to one of its possible values x = (x_1, x_2, …, x_n); (ii) picking a uniformly random index i (1 < = i < = n); (iii) sampling x_i from Pr(Xi | x, e) (represented by the conditional probability tables in the generated Bayesian network); (iv) updating the “x” vector; and (v) going back to step 2. It has been proven that the stationary distribution of the Markov chain is just the sought‐after joint distribution. Thus, drawing samples from the Markov chain at long enough intervals, i.e. allowing enough time for the chain to reach the stationary distribution, gives independent samples from the distribution P(X1, …, Xn|e).

We make use of the Gibbs sampler algorithm in Seclius for two main reasons. First, the DG model's joint distribution is not explicitly known initially; and second, analytical calculation of it can be tedious, if it is even possible, especially for large DG graphs. The Gibbs sampler uses the DG's CPT to generate a large number of samples from the Pr[CT| s] distribution without directly calculating the density function. Similarly, the security measure is estimated individually for each system state. Therefore, if the attacker modified any other object and/or got more privileges, the system would switch to a new state, whose security measure would be separately evaluated.

It is worth emphasizing that Seclius does not use the DG model to estimate how the attacker contacts other objects from a compromised object, such as a tainted process, to exploit a vulnerability and/or escalate their privileges. Seclius uses the DG only to estimate how the tainted data would propagate through other non‐compromised system assets, which would behave normally as they did during the learning phase. For every asset already compromised, Seclius assumes a pessimistic behavior model, i.e. the asset deterministically contacts all other assets in its privilege domain.

Seclius estimates the security measure of each system state in the state space and stores the values in a table in an offline manner. Thus, given the system state, it looks up the table and instantly retrieves the corresponding value. If the state space is too large to be preprocessed, Seclius dynamically runs the belief propagation given a system state and estimates the security measure in an online manner.

The state notion also encodes the privilege domains controlled by the attacker in each state. When estimating the security of each state, Seclius assumes that OS objects in the privilege domains that have not yet been compromised behave normally, i.e. they respect the system's dependency graph generated during the learning phase without any attack. However, Seclius pessimistically assumes that the objects in the compromised privilege domains contact (i.e. propagate the tainted data to) all the possible objects in that domain.2

That approach can evaluate the security of each system state. However, the exact current security state of the system usually is not completely observable, due to IDS inaccuracies, i.e. false positive and negative rates. We define the notion of the information state of the system, which formally is a probability distribution over all states in the state space of the system S. Once the information state of the system has been estimated, Seclius computes the expected security measure of the information state.

7.4 Future Directions

In this section, we discuss current limitations of Seclius and potential solutions to address each of them.

First, as in any learning algorithm, it is not guaranteed that the learned DG actually captures every dependency. One trivial solution would be to make sure that the learning phase is long enough to capture all the dependencies. Alternatively, an active learning algorithm could be used. For instance, the configuration files could be parsed to extract potential dependencies, or a mechanism could make sure all the program paths are traversed. Replacing passive learning with an active algorithm would require application‐specific knowledge; however, it would help to accelerate the learning phase.

Second, the evaluated security value will be affected by the accuracy of the underlying intrusion‐detection solutions, i.e. if the intrusion detectors miss some malicious events. Our main contribution in this paper is in showing how to make use of the system dependency graph and the security requirements to evaluate the security of any given state; in other words, we do not claim to have come up with a new intrusion‐detection technique. However, our tool, which makes use of Seclius to evaluate system security, takes under consideration intrusion‐detection inaccuracies, i.e. false positive and negative rates, if provided. Additionally, security evaluation by Seclius is done based on past consequences, which are easier to detect than exploitations. As a case in point, detecting that a web server is unavailable is usually simpler than determining the exploit that caused the server crash.

Additionally, because Seclius is an information flow‐based metric, when the system has not yet been attacked, Seclius usually evaluates the system security to be close to absolute, but not 100% secure. This is because even during the system's normal operational mode, information is often flowing from external end points, where attackers potentially reside, to critical assets. A possible solution to this problem would be to normalize the evaluated security measure based on the measure of the non‐compromised system.

7.5 Conclusions

We discussed the major challenges provisioning scalable and efficient risk‐assessment and disaster‐recovery techniques within large‐scale cloud infrastructures. Additionally, we proposed Seclius, an online security‐evaluation framework that uses dependencies between OS‐level objects to measure the probability that critical assets have been directly or indirectly compromised. The different components of our framework address three important limitations faced by traditional security‐evaluation techniques. First, a consequence tree captures the subjective security requirements and minimizes administrator input. Second, Seclius processes IDS alerts online to measure actual attack consequences and does not rely on assumptions about attacker behaviors or system vulnerabilities. Third, a dependency graph is combined with a taint‐tracking method to probabilistically evaluate the system‐wide impact of locally detected intrusions as well as attacker privileges and security domains, without making assumptions about attack paths.


