Confidentiality of Data in the Cloud: Conflicts Between Security and Cost

Nathalie Baracaldo1 and Joseph Glider2

1 IBM Almaden Research Center, San Jose, CA, USA

2 SAP Labs, Palo Alto, CA, USA

3.1 Introduction

Data confidentiality has been and remains a large concern for online and especially cloud‐resident data. Information, once naturally protected by limited or no network connectivity outside of the information owner's domain, is now potentially vulnerable to theft or corruption resulting from any of a growing set of possible attacks. This chapter describes the trends of the last 20 years that have increased data‐confidentiality concerns, technologies that have been used to address these concerns, conflicts between those technologies and the cost‐reduction measures that cloud providers put in place, and some possible approaches to reconciling the confidentiality technologies with cost‐reducing features.

Section 3.2 of this chapter presents some background on cloud storage systems and reasons data‐confidentiality concerns have grown over the past 20 years. Then, Section 3.3 discusses concrete confidentiality issues and adversaries of cloud storage systems. Section 3.4 presents some common techniques used to protect confidentiality in current cloud storage systems, and Section 3.5 shows why these protection techniques often conflict with data‐reduction methods, resulting in an increase of costs. Then, Sections 3.6 and 3.7 provide an overview and comparison of potential solutions and develop in more detail one such possible solution. Finally, Section 3.8 looks at future directions for cloud storage confidentiality.

3.2 Background

To understand the new confidentiality issues that arise when outsourcing data to cloud storage providers, we first overview the history of how they came to be. As recently as the year 2000, most access to data was physically restricted. Personal data was often on paper or perhaps on home computers that had limited access to public networks such as the Internet. Cellular phones contained no or very little personal data, perhaps limited to a set of phone numbers of contacts. Enterprise and government data was generally restricted to be accessed within the logical confines of the entity that owned the data, with only carefully controlled exceptions such as backups stored offsite, or information shared via private network connections with business or project partners.

In the early part of the 2000s, storage service providers (SSPs), offering storage capacity subscription services for enterprise, began to emerge. However, they ran into major obstacles balancing cost, performance, and security. For cost reasons, SSPs preferred to have multiple or many customers running on the same storage systems; but customers, for performance and security reasons, preferred to have their data isolated from others' in separate storage systems or different partitions within the same storage system. Ultimately, no satisfactory business model was found, and SSPs didn't find a successful path to profitability.

Some companies such as IBM and HP became managed storage providers (MSPs) for enterprise clients, managing storage located either in customer premises or sometimes in an MSP data center. The business model for this service was based on the MSP providing management and maintenance services less expensively than the client could,1 and this storage services model has had success in the market. However, the basic premise of this service has been that the set of Internet technology (IT) equipment used by a client is entirely owned by the client, and therefore as long as networks connecting the MSP IT equipment and the client's IT equipment are kept secure, confidentiality and privacy of the client data is as assured as with an on‐premises private data center.

Starting in the mid‐to‐late 2000s, a number of trends have dramatically changed the landscape of data privacy. The Internet became widely available and then almost ubiquitous. Mobility of data, with smartphones and tablets and more recently with many other types of electronic devices, has caused more and more personal data, some of it sensitive, to be network accessible. Most pronounced, the advent of cloud services—the new generation of application service providers (ASPs) and SSPs—has attracted a large amount of personal and business data such as cloud backups, music, and photo archives coming from mobile devices, peer‐to‐peer file sharing, e‐mail, and social networks. Consumers and enterprise clients alike have found that cloud storage systems are a good choice for archiving and backups as well as primary storage. Cloud storage providers such as Amazon Web Services (AWS), Microsoft Azure, Dropbox, and Google Drive simplify data management for these tenants by offering an online service that abstracts and simplifies storage system configuration. Using these environments is especially beneficial for businesses that are looking to reduce their costs, want to deploy new applications rapidly, or do not want to maintain their own computational infrastructure.

In addition, the attraction of having data online and available to customers or the public has been quickly recognized by government organizations as well as businesses. As a result, a much larger volume of data kept by small and large organizations has increasingly become potentially exposed, and there has been an explosion of data breaches, as documented by yearly updates of the Verizon Data Breach Investigation Report (Verizon 2018), resulting in the exposure of confidential data (PCWorld 2010; GigaOm 2012), temporary or permanent loss of availability (Metz 2009), or data corruption. There have been cases where client data was exposed to and leaked by cloud provider employees who had physical access to the storage medium, and also where cloud tenants gained access to other tenants' data after having been assigned physical storage resources previously assigned to another tenant (e.g. after that other tenant had canceled its cloud storage subscription).

While consumers might have been (and some or many still are) unwary about the issues associated with exposing their private data, enterprises and government organizations have generally been more aware and more cautious about allowing their data to be cloud resident, rightly worrying that they might lose control of the privacy of their data when they trust it to be stored in the Cloud. Personal health‐related data is one such class of data. Healthcare organizations have increasingly made patients' health‐care records available online, and patients find great value in being able to access their own records; but keeping that data private has been a major concern, leading to government regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Regulations in other industries, including the Gramm‐Leach‐Bliley Act (GLBA), the Payment Card Industry Data Security Standard (PCI DSS), and the European Union’s General Data Protection Regulation (GDPR) have had a large impact on how that data is stored.

Given the continuing pattern of security incidents and breaches, some organizations have tended to use cloud computing infrastructure only for data and projects that are considered nonsensitive (Chow et al. 2009, pp. 85–90). Security of data stored in cloud storage systems needs to be improved, to reduce reluctance and thereby increase cloud storage adoption. Therefore, one of the largest concerns of cloud storage providers is finding a way to assure potential clients that their data will, under the care of the cloud provider, be accessible only to entities authorized by the client.

Improving the privacy of data in cloud storage environments often comes at higher cost. For instance, the capacity of cloud providers to achieve economies of scale often comes from sharing infrastructure among multiple tenants. To completely avoid threats related to multitenancy, special security provisions need to be in place. In the extreme case, the infrastructure is not shared (e.g. private cloud), while in other cases cloud data centers are strictly divided by tenant or data is encrypted by the tenant. In all these cases, the need for confidentiality increases the cost for both tenants and cloud providers. Costs may also increase because cloud administrators are required by tenants to perform all types of maintenance tasks, such as backing up data, yet they are not fully trusted and therefore must not be able to access the data in plaintext.

Clearly, there is a tension between providing better privacy and maintaining reasonable cost levels. Achieving low cost is a primary requirement for cloud storage systems: cloud providers' ability to compete and even their survival depends on being able to offer storage services at the lowest cost. There have been several price wars in the years since cloud storage started to gain traction, and cost will continue to be a primary consideration for customers.

Although maintaining privacy while managing costs is not an easy task, we believe that it is possible to find a way to achieve both. We highlight the challenges related to reconciling confidentiality and cost caused by outsourcing data storage to the Cloud, and study in detail the challenges and possible solutions for using data‐reduction techniques such as compression and deduplication (a technique where multiple files contain some or all of the same content, to store only one copy of the duplicate content) to reduce the amount of required disk capacity, while maintaining confidentiality of data encrypted by tenants.

3.3 Confidentiality: Threats and Adversaries

Data confidentiality is one of the primary security requirements for cloud storage systems. From the time customers first store their data and even past the time they stop using the cloud provider, customers want assurances that other tenants, cloud provider administrators, and Internet users at large will not have unauthorized access to their data:

  • Confidentiality goals—Tenants require that data stored in the Cloud should be accessible only by authorized entities (the data owner and delegates), both at rest, when data is stored in persistent media, and in flight, when it is transmitted to/from the cloud provider's storage system. Additionally, once a tenant decides to cancel its subscription to a storage provider (called tenant offboarding), its information should remain inaccessible. We call this property secure offboarding. Finally, tenants also require that data deleted upon tenant request should not be accessible by any entity.

    Adversaries—Providing confidentiality is not an easy task in cloud storage systems. In this environment, multiple adversaries may try to gain unauthorized access to data stored in a cloud provider infrastructure. These include curious cloud administrators, malicious tenants, law enforcement, and external adversaries that can monitor, probe, and try to infiltrate the system:

    • Curious cloud administrators are particularly dangerous because they have legitimate access to software and physical infrastructure of the system to troubleshoot and maintain the infrastructure. However, they are not trusted to read confidential information stored by tenants. Curious cloud administrators may use their privileges to obtain access to confidential information. Additionally, physical access to the storage media may allow them to circumvent all the software‐based access‐control mechanisms and directly access tenant data written to disk.
    • In multitenant cloud storage environments, malicious tenants may be present. These adversaries try to obtain data from other tenants, taking advantage of their legitimate remote access to storage media. They may try to poke the system to retrieve data from their currently assigned disk space, hoping the physical space was previously assigned to other tenants and still contains private information.
    • For some tenants, one risk of storing information is the possibility of law enforcement examining their stored data. This adversary is personified by law enforcement agencies that require the use of court orders to gain access to the content of all storage servers, media, key repositories, and other components of the cloud storage system. Some cloud tenants may consider them adversaries that may try to retrieve confidential data and metadata stored in the cloud environment that may reveal confidential information and private usage patterns.
    • External adversaries have no legitimate access to the system and may try to remotely compromise it. These adversaries try to escalate their privileges to gain access to confidential information or compromise the integrity or availability of the storage system.

All these adversaries may try to compromise the confidentiality of data belonging to tenants. In the following section, we present existing mechanisms to prevent leakage of confidential information.

3.4 Achieving Data Confidentiality in Cloud Storage Systems

Tenant offboarding and disk‐space reassignment open windows of opportunity to compromise the confidentiality of tenants by curious cloud administrators, malicious tenants, and law enforcement, especially if the cloud provider does not follow appropriate procedures. Securely erasing data from a storage system minimizes this opportunity but relies on the capabilities and cooperation of the cloud provider. It is not enough to remove the pointers to the occupied space and mark it as free for usage, as is the general practice for most file systems; if this is the only procedure followed, the next user given access to the space, with sufficient technical skills, may be able to see the data written by the previous user of the physical space.

Instead, cloud providers today can use an expensive (in time required and use of system resources) process that physically overwrites, one or multiple times, the sectors holding the data (Joukov et al. 2006, pp. 61–66; Wei et al. 2011). However, not only is this type of procedure expensive, but it also may be difficult or impossible for the cloud provider to physically erase the data. For example, certain types of storage devices such as solid state disk (SSD) cause writes to the same logical space to actually write the new blocks to new physical space and free for reuse the physical space where the previous version of the data was stored, defeating the efforts of the cloud provider to ensure that data is truly erased. Furthermore, placing special controls in SSD to ensure that previously stored data is immediately physically erased would be prohibitive in terms of system lifetime: the SSD would wear out enormously faster, because Flash devices can only endure a certain number of write cycles. Any log‐structured storage system, where data written to the same logical location is never written directly over the old data in the logical location (e.g. ZFS [zfs 2018], BTRFS [btr 2018]) will have similar concerns.

Because secure erasure by a cloud provider is costly in time and resources and in some cases practically impossible, cryptographic approaches are preferred.

3.4.1 Cryptographic Solutions

Encrypting data is becoming a popular approach for tenants that want to use cloud storage infrastructure without compromising the confidentiality or privacy of their data and to avoid possible jurisdiction problems (Microsoft 2017). There are two approaches: stage encryption and end‐to‐end encryption. Stage Encryption

Multiple systems opt for encrypting data before sending it to the Cloud using transport encryption (e.g. Transport Layer Security [TLS]), to achieve in‐flight confidentiality and delegate the responsibility of data‐at rest encryption (a.k.a. server‐side encryption) to the cloud provider. These schemes are easy to implement because popular protocols like Secure Sockets Layer (SSL) and TLS can be used to protect data during its transmission. Once the encrypted data arrives at the cloud provider, it is decrypted; it is again encrypted before storing it to disk. This latter encryption may be performed by self‐encrypting drives or by performing encryption at higher virtual data object levels such as by volume, directory, or file. In this way, if an adversary gains access to the physical storage medium, they cannot obtain any information. Examples of cloud storage systems that use this type of methodology are IBM Cloud Object Storage (Cleversafe 2018) and HGST Active Archive System (Amplidata 2018), which can receive transport‐encrypted data and in turn store objects encrypted.

This scheme is useful when a component in the cloud provider infrastructure needs to access plaintext data sent by a tenant (e.g. for antivirus checking). In addition, stage encryption thwarts malicious tenants that have been reassigned disk space previously used by a targeted tenant. However, this method is not capable of stopping attackers such as curious cloud administrators and law‐enforcement adversaries that can gain access to the data while it is in plaintext or possibly gain access to the keys with which the data at rest was encrypted. Additionally, because cloud providers may have access to the encryption keys used to store the data, the process of offboarding does not guarantee that a tenant's data is no longer available. As long as the provider has the keys, data may still be accessible to both curious cloud administrators and law‐enforcement adversaries. Finally, tenants need to accept the encryption mechanisms of the cloud storage systems, which may not be as secure as needed. For example, in 2011, Dropbox was accused of using a single encryption key to encrypt all files in the system. This effectively meant that all registered users were sharing the same encryption key to encrypt their information, resulting in a large risk related to an attacker gaining access to the single key. End‐to‐End Encryption

This scheme is often used today by tenants storing data that is required to be handled so as to meet privacy and confidentiality compliance regulations, to ensure not only that data is kept confidential, but also that keys are properly controlled and auditable. To achieve this objective, the tenants themselves must be able to control and safeguard encryption keys for the data at all times.

For this purpose, data is encrypted before it is sent to the cloud storage provider, achieving both in‐flight and at‐rest confidentiality. The advantage of this scheme is that no cooperation from the cloud provider is required. Provided that data is encrypted with a semantically secure cryptosystem and that each tenant is the only entity with access to its encryption keys, neither malicious tenants, curious administrators, nor law enforcement adversaries (unless the law‐enforcement adversary gains access to the client's keys) can compromise the confidentiality of the information stored in the Cloud. This is true because all data stored in the Cloud is encrypted, and the only way to obtain its plaintext is by gaining access to the encryption keys that are in possession of the tenant. Furthermore, this mechanism ensures a smooth offboarding process, because only data encrypted with tenant keys is left with the cloud provider.

Additionally, end‐to‐end encryption allows key management to be easier and more effective for tenants. To remove all traces of information from storage media even in the presence of law‐enforcement court orders to obtain cryptographic key material, tenants that use end‐to‐end encryption can securely erase their target keys whenever they want, to securely erase information. In contrast, when stage encryption is used, tenants often cannot control when encryption keys are deleted. An interested reader may refer to (Boneh and Lipton 1996, pp. 91–96; Di Crescenzo et al. 1999, pp. 500–509; Perlman 2005; Mitra and Winslett 2006, pp. 67–72; Tang et al. 2010; Cachin et al. 2013) where multiple cryptographic key management schemes to achieve secure data deletion are presented.

Tenants' business requirements may also result in the use of multiple encryption keys and end‐to‐end encryption facilitates this type of management. It is often the case for tenants to have multiple users that do not necessarily trust each other. For instance, a CEO may not want to share information with all employees of her company. This requires use of multiple keys to encrypt data according to tenants' business requirements, ensuring that only authorized users can obtain data in plaintext. Stage encryption limits tenants' ability to control isolation of their data. For example, it can be limiting for Amazon server‐side encryption tenants to assign an encryption key per container to achieve isolation between different users. In comparison, end‐to‐end encryption allows users to assign encryption keys flexibly to fit whatever access control policy is required, like the ones presented in (Sahai and Waters 2005; Goyal et al. 2006, pp. 89–98; Bethencourt et al. 2007).

Unfortunately, end‐to‐end encryption introduces several drawbacks and costs. Tenants need to maintain servers and software infrastructure to assign, protect, and maintain encryption keys. This means tenants need to be in charge of maintaining a key repository, making sure it runs in a nonvulnerable server, and ensuring that uniquely authorized users can access it. Additionally, tenants need to ensure that their key repository can scale and does not limit the throughput of the system. To achieve these objectives, a skilled security administrator needs to be hired, and servers need to be provisioned, which clearly increases tenants' costs. End‐to‐end encryption also inhibits the cloud provider from performing any useful computation over the data. For example, searching for a particular record in a dataset cannot be performed by the cloud provider, nor can analytics be run, nor can algorithms such as compression be employed to decrease subscription costs by reducing the amount of space required to store the data. To alleviate these problems, a few research efforts are being conducted: to perform searchable encryption (Boneh et al. 2004, pp. 506–522) and to improve the efficiency of homomorphic encryption, which is a methodology that allows computation of certain operations over encrypted data without revealing the encryption keys to untrusted parties (Gentry 2009). Unfortunately, these methods are currently too slow to be used in real‐world scenarios.

End‐to‐end encryption is a popular choice to protect data confidentiality because it does not require the cooperation of cloud providers. In the following section, we overview the impact of using this technique on the amount of storage capacity required and later present how cloud storage systems can recapture the cost savings.

3.5 Reducing Cloud Storage System Costs through Data‐Reduction Techniques

Cloud storage providers offer a variety of storage services to their clients to allow for different levels of performance (e.g. solid‐state drive [SSD] vs. hard disk drive [HDD]); provide reliability (e.g. across availability zones or regions, different levels of Redundant Array of Independent Disks [RAID] or erasure code); and offer different ways of accessing the data, such as object protocol (Amazon S3), file protocol (e.g. Google Drive), and block protocol (e.g. Amazon Elastic Block Store [EBS]). Cloud storage tenants can choose among these options to match the level and type of service required for their applications.

Regardless of the differences between the many types of storage and storage services deployed in the Cloud, their cloud presence implies a common set of requirements and concerns.

To lower costs, storage systems are increasingly including data‐reduction capabilities in order to decrease as much as possible the amount of storage capacity needed to store tenants' data (EMC 2018; IBM 2018a). Using these techniques often leads to a reduction of 20–70% of disk‐capacity utilization (Constantinescu et al. 2011, pp. 393–402) and lowers costs due not only to less capital costs but also to less operating costs related to data‐center real estate, power consumption, and cooling (Russell 2010).

Tenants are also increasingly using encryption as a primary tool to build confidentiality assurance. Data is more and more being encrypted within customer premises (and keys are likewise being stored in customer premises) before being uploaded to cloud storage. In another use case, customers with compute and storage at a cloud provider may encrypt the data in the virtual machine (VM) or application server at the cloud provider before storing the data (e.g. Cryptfs [Zadok et al. 1998], IBM Spectrum Scale [IBM 2018b]). In either case, the tenant is assured that the data is only seen in plaintext in the tenant's server or virtual machine, and further that the tenant controls the keys to the data, such that the data will be unreadable if the physical medium is stolen or allocated to another tenant during or after the time the tenant is an active user of the cloud provider.

Unfortunately, data‐reduction techniques lose some or all of their effectiveness when operating on encrypted data. Impacted techniques include thin provisioning, zero‐elimination, compression, and deduplication.

Thin provisioning refers to the ability of a storage system to store only the actual data that is written. As an example, this technique is used by file systems to implement sparse files: an extremely large file is allocated but only small sections of it are written, in which case the storage system allocates only a small amount of space. In this case, a problem emerges when an application attempts to read from an area of a sparse file that wasn't written. The expected storage system response is to return zeroes; however, when data is encrypted, the storage system cannot return the encrypted version of zeroes. It doesn't know the data is encrypted, and even if it did, it would have no knowledge of encryption keys and other information needed to encrypt and subsequently return encrypted zeroes. This results in difficulties leveraging thin provisioning.

Likewise, many storage systems have the ability to detect large runs of written zeroes or other fill characters and choose to store them in a way that dramatically reduces the capacity needed. However, encrypted fill patterns are not detectable as a fill pattern, and therefore the storage system cannot optimize the capacity needed to store the data blocks containing fill pattern.

Compression is another technology that is impacted. Compression is effective on patterns discovered in streams of data, but encrypted data has very high entropy and therefore compression is not able to effectively reduce capacity required to store the data.

Finally, deduplication is not effective when applied over encrypted data. The term deduplication refers to the capability of detecting the same or similar content (often blocks of 4 KB or more) and storing only the one copy and perhaps differences relative to that copy. However, semantically secure encryption ensures that encrypting the same data multiple times result in a different ciphertext (initialization vectors are used to ensure that the same plaintext results in different ciphertexts to avoid inference attacks). Therefore, storage systems cannot determine that the same information is being stored.

3.6 Reconciling Data Reduction and Confidentiality

We now present several existing methods to allow end‐to‐end encryption to be employed while preserving the benefits achieved from data reduction. All of them have inherent trade‐offs. We classify them in two main categories depending on where data‐reduction techniques are performed.

There are two ways to arrange a data storage stack such that data can be encrypted close to the source while also being stored encrypted with tenant‐controlled keys, in a compressed and deduplicated form. The first method (called client data reduction [CDR]) is to compress and deduplicate the data before the data is encrypted at the tenant side in the application server. The second approach is to encrypt data in the application server and leverage trusted execution technology downstream in the cloud storage to apply compression and deduplication without leaking confidential data. We call this second approach Trusted Decrypter (TD).

Both of these solutions are potentially applicable, but they have trade‐offs at three levels: hardware resource and performance, application storage software, and security.

At the hardware resource level, the CPU cycles and memory to perform deduplication and compression using CDR may be costly, especially as many servers would need to be provisioned to accommodate the extra resource requirement. On the other hand, a shared resource such as a TD‐enabled storage system can be a more efficient place to provision the hardware resources needed to perform compression and deduplication, as long as latency and throughput concerns are addressed and the resulting storage system is still cost‐effective.

At the application software level, supporting CDR compression and deduplication in many environments requires extra software being installed, configured, and maintained across many servers to enable storage‐efficiency functions to be applied before data is encrypted. In addition, many tenants would find it difficult to understand how to make purchasing and provisioning decisions about CDR‐enabled application servers to accommodate the heavy resource requirements of data‐reduction functions. Having instead a shared embedded compression/deduplication function available in a TD‐enabled storage system can provide benefit by taking away the administration burden from the cloud provider.

Concerning security, providing upstream data‐reduction functions in CDR‐enabled application servers ensures that the security provided by encryption in the application server is intact. In contrast, the TD downstream decryption has the potential to introduce security exposures that limit the use cases for such a method. We note, however, that there are side‐channel attacks (described later) that can be exploited in CDR‐enabled systems that cannot operate in TD‐enabled systems.

Because of these trade‐offs, neither method is the single preferred option, and a choice of solution will be made based on specifics about operations within individual cloud providers and the service levels they want to provide. In the following section, we overview existing techniques in each area and contrast their security.

3.6.1 Existing Techniques

Compression and deduplication are two of the most popular techniques used to reduce the amount of data stored in a system (Constantinescu et al. 2011, pp. 393–402). Each of these operations has a different level of security challenges. Compression algorithms typically do not require data sharing, and it is possible to achieve good compression ratios by using file‐based compression. In contrast, deduplication may use information already stored in the system, which results in an implicit sharing of data among entities storing their information in the same deduplication‐enabled system. For this reason, deduplication results in more security concerns than compression.

From the security perspective, performing deduplication at the client side creates problems. In particular, adversaries authorized to send write requests to the storage system may compromise the confidentiality of other tenants by using the deduplication system as an oracle to determine whether a particular piece of data is already stored in the system and later gain illegitimate access to previously stored data (Harnik et al. 2010). To perform this attack, the adversary sends a claim to the storage system stating that they want to store a file or block with a given hash value. In client‐side deduplication, when the hash matches an already‐stored block or file, the data is never sent to the storage server. The deduplication system receives a hash claim and proceeds to add the adversary as an entity accessing the deduplicated data. Later, the adversary can request the deduplicated data. Therefore, by claiming ownership of the file, an adversary can obtain access to previously stored data that matches a claimed hash value.

Using the deduplication system as an oracle may have negative and tangible consequences in real cloud storage systems. Dropbox uses a file‐based deduplication system that stores a file only if it has not been previously stored by any other user in the system. In this scenario, by claiming a hash value, an adversary may be able to obtain access to previously stored information, which clearly violates the confidentiality of other users storing files in the system.

To avoid this pitfall, Halevi et al. proposed a methodology called proofs of ownership (PoWs) (Halevi et al. 2011, pp. 491–500) to efficiently determine whether a user has a particular piece of data without having the user send a complete copy of the data to the server. In this approach, the user claims to have a file and sends its hash. Then, the deduplication system challenges the user to verify that it has the file the user claims to have. Different approaches to solve this challenge have been proposed in the literature (Halevi et al. 2011, pp. 491–500; Di Pietro and Sorniotti 2013). In general, the client needs to compute multiple hash values of subcontents of the claimed data. An adversary that does not have the claimed data cannot answer the challenge correctly, and hence the attack is thwarted.

By using proofs of ownership, tenants can have some assurance that the deduplication table can't be used as an oracle to leak their stored information. However, this scheme still leaks some information about the data stored. It has been shown that monitoring the amount of bandwidth required to upload data to a cloud storage server can be used to detect whether the data being uploaded was previously stored. This side‐channel attack arises because when deduplication is possible, the data is not transmitted, whereas when the write request is the first to store a piece of data, all data needs to be transmitted, thus increasing the time it takes to complete the operation.

Other solutions use convergent encryption to perform deduplication (Bellare et al. 2012). Convergent encryption is currently used by multiple cloud storage services such as Freenet (Project 2018), among others. Convergent encryption is a cryptographic primitive in which any entity encrypting the same data will use the output of a deterministic function of the plaintext as the key. In this way, identical plaintext values will encrypt to identical ciphertexts, regardless of who encrypts them. Convergent encryption offers a weaker notion of security for encrypted data than the one provided by conventional symmetric cryptosystems (Bellare et al. 2012). In particular, it is susceptible to offline brute‐force dictionary attacks, and it is not secure unless encrypted data is unpredictable in nature. Because common data has predictable patterns, like headers, convergent encryption is not ideal to ensure protection against leakage of confidential data. To mitigate such leakage problems, multiple variations to pure convergent‐based deduplication have been proposed.

A methodology that uses convergent encryption and a threshold cryptosystem was presented in (Stanek et al. 2013). Popular data is assumed not to be confidential, and uncommon data confidential. When data is popular, it is only encrypted using convergent encryption, whereas unpopular data is additionally encrypted with threshold encryption. Once unpopular data becomes popular, the outer encryption layer is removed. This approach is not appropriate when popular data is confidential. A scheme to perform deduplication at the client side was presented in (Rashid et al. 2012, pp. 81–87), where deduplication units are encrypted before being sent to the cloud provider for storage. This work assumes that the deduplication information is stored in the client side and that all data in an organization is encrypted with the same key. This is not appropriate, because it would result in all the individuals in an organization having access to all information in the tenant side, violating the least‐privilege principle; compromise of the encryption key would result in the leakage of all stored information.

DupLESS (Bellare et al. 2013) is another scheme related to convergent encryption that makes use of an external key server (KS) to help generate keys in combination with convergent encryption, as opposed to using the hashes of the deduplicated data as keys. Each tenant is assigned a secret. Similarly, the KS holds its own secret. When a tenant needs to store data, it contacts the KS to generate the key for uploading the data. An oblivious pseudo‐random function (PRF) protocol (Naor and Reingold 2004, pp. 231–262) is used between the KS and tenants. This protocol ensures that the KS can cryptographically mix both secrets to compute the deduplication key without learning anything about the data tenants want to upload or the generated keys, while tenants learn nothing about the KS's secret. As long as the KS is not compromised, DupLESS provides more security than standard convergent‐based deduplication systems. When the KS is compromised, DupLESS is equivalent to standard convergent‐based deduplication systems. One drawback of this scheme is that it requires a trusted third party to maintain the keys, which increases the cost. Additionally, the KS's secret is identical for all tenants, which is subject to collusion attacks of the KS and other tenants. To alleviate this problem, a similar approach that uses multiple trusted parties to compute the key was presented in (Duan 2013). This system relies on peer‐to‐peer (P2P) networks to maintain a shared secret. Thus, to obtain the shared secret, at least a threshold number of collaborators must be compromised. One problem with this approach is that it is not clear what the incentives are for P2P nodes to provide this service, for them to act honestly, and for tenants to trust them.

One of the few efforts concerning data compression uses a methodology based on distributed source‐coding theory to perform compression over encrypted data (Johnson et al. 2004, pp. 2992–3006). Unfortunately, this method is only efficient for encrypted data in the case of an ideal Gaussian source. The compression rate is reduced for more general and common data distributions such as those encountered in cloud computing scenarios. Additionally, this approach does not support deduplication and has not been widely studied to determine its security properties. Similarly, at the time of writing of this book, an algorithm to combine compression and encryption was presented in (Kelley and Tamassia 2014). However, this methodology is in its infancy and still needs to be studied by the security community before it can be widely trusted and adopted.

The Trusted Decrypter framework provides data‐reduction capabilities downstream of where data is encrypted. The framework consists of a small trusted module and several secure data‐reduction algorithms that leverage trusted execution technology to provide confidentiality guarantees to tenants. The framework provides for data confidentiality by encrypting data close to the source with a key uniquely owned and controlled by the tenant, ensuring that none of the tenant's data is accessible to the cloud provider and that secure offboarding from the cloud provider is easy and effective. Additionally, the framework generally requires no changes in the tenant's applications and minimum changes in the component encrypting tenant data, making it easier to integrate into current systems. For these reasons, in the following sections, a Trusted Decrypter architecture is presented and evaluated.

3.7 Trusted Decrypter

This section explores aTrusted Decrypter architecture. We present an overview of the architecture, a detailed description with an emphasis on security, and the results of some experiments showing the overheads and performances.

3.7.1 Overview

Driven by the requirement of enabling data reduction on encrypted data, maintaining an acceptable level of confidentiality through the lifetime of the data, and reducing as much as possible management overhead for tenants, in this section we present the Trusted Decrypter (TD) architecture originally published in (Baracaldo et al. 2014, pp. 21–32). This architecture was designed for tenants that require their data to be encrypted prior to its upload to the cloud storage provider. It provides confidentiality while the data is transmitted from the client to the cloud provider, when the data is stored in the cloud provider, and after the data is erased from the cloud provider. Additionally, the architecture allows secure offboarding, ensuring that reallocation of storage space from one user (tenant) to the other does not reveal any confidential information.

An overview of the TD framework is presented in Figure 3.1. The tenant side portrays the standard components of an IT system that uses an outsourced storage system, while the cloud storage provider offers a shared storage service. All messages exchanged between the system entities are exchanged over secure channels that provide perfect forward secrecy.

Image described by caption and surrounding text.

Figure 3.1 Overview of the Trusted Decrypter framework. On the tenant's side, as is typically the case, a business application such as a word processor generates data that is encrypted by a storage application. Additionally, each tenant manages its own key repository. The auxiliary repository can be hosted by the tenant or by the cloud provider. On the cloud provider side, all data‐reduction operations are performed, and the stored data is stored to disk.

The tenant side consists of the following components: a business application (BA) that generates and processes data; the BA uses the services of a storage application (SA) to commit data to persistent storage and a key repository (KeyRep) that maintains the master keys used during encryption. The SA in turn uses the storage services offered by the storage system of the cloud provider to store data. To comply with the tenant's security requirements, the SA encrypts data prior to outsourcing it. Tenants may encrypt their data using one or more keys, which is a relevant requirement especially given tenants that consist of one or more users (e.g. the employees of a company) and require each user to be assigned their own master encryption key. The TD also includes an auxiliary repository (AuxRep) used to maintain encryption metadata for each block of uploaded data; this metadata is referred as auxiliary information and can be hosted by the tenant or by the cloud provider. Thus, every time the SA encrypts data, it updates the AuxRep with the corresponding auxiliary information. The encryption support is provided either natively by the hypervisor or by software running inside the VM or client machine.

The cloud storage provider consists of the Trusted Decrypter (TD) and the storage media (e.g. disks, tapes) used to store data. Storage requests sent by clients are received and processed by the TD. Depending on the type of request, the TD performs the appropriate data‐reduction operations: for write requests, the TD decrypts the received data using the tenant key, compresses it, deduplicates it, re‐encrypts it with the tenant key, and sends it to persistent storage. For read requests, the TD retrieves the compressed and/or deduplicated data from disk, decrypts it, decompresses it, inverts the deduplication process, re‐encrypts it using the original context and keys, and sends the result back to the requesting user. The user is not aware of this process, because it receives the data exactly as it was originally written. Throughout its operation, the TD contacts both the AuxRep and KeyRep to fetch the appropriate information to decrypt and re‐encrypt data according to the received requests.

As shown in Figure 3.2, the TD consists of a critical module (CritMod) and a coordinator module (CoMod). The first module is in charge of all security‐sensitive tasks, whereas the second performs all those that are not. For this reason, the critical module is secured with the aid of the root of trust and isolation platform. The root of trust and isolation platform is used to allow the TD to attest to remote third parties the integrity of the platform and to restrict untrusted processes from accessing confidential information while it is in memory. Special security provisions and implementations are discussed in Sections 3.7.3 and 3.7.4, respectively. As long as these provisions are in place, the TD can be trusted by tenants.

Diagram of the Trusted Decrypter consisting of a critical module (CritMod) and a coordinator module (CoMod). Both modules are connected to auxiliary and key repository and physical disks and deduplication….

Figure 3.2 Detailed design of the Trusted Decrypter.

The CritMod has several subcomponents: the key‐retrieval module, the cryptographic module, and the data‐efficiency module. The key‐retrieval module, through an attestation protocol, retrieves the key specified in the auxiliary information from the KeyRep. Upon the successful outcome of the attestation process, the key is retrieved and is used by the cryptographic module. The KeyRep is trusted to uniquely provide cryptographic keys to a trusted TD. The CritMod is always responsible for decrypting data coming into the TD and re‐encrypting data leaving the TD, and the data‐efficiency module is responsible for efficiency‐related data‐stream transformations according to whether data is being read or written. The AuxRep can be hosted by either the cloud provider or the tenant, because the information it maintains (initialization vectors and type of encryption algorithm used) is by definition public.

3.7.2 Secure Data‐Reduction Operations

Although having a trusted component is necessary, and its design is challenging in itself, it does not solve all the security problems. In fact, when multiple entities in the system use different keys, as is the case with corporate environments, performing deduplication and storing data in a way that maintains its confidentiality is a challenging task.

First, we present an overview of the cryptographic schemes used by tenants and data‐reduction methodologies used by the cloud storage provider. Then, we present the details of the secure data‐reduction operations performed by the TD. Preliminaries

We begin by looking at data encryption performed in the tenant. The SA uses a symmetric cryptosystem. For simplicity of description, in the following, the explicit mention of the initialization vector (IV) is omitted and simply denoted by {m}k: the encryption of message m with key K. In addition, the terms wrap and unwrap are used to refer to encryption and decryption of keys (as opposed to data blocks). The SA encrypts individual blocks of fixed size.

The SA encrypts data using hierarchically organized keys. The hierarchy can be modeled with two‐level trees: at the root of the hierarchy of keys, users have their own master encryption keys (MK): these keys belong to and are used by a particular user of a tenant, and each key is stored and managed by the KeyRep of the owning user. Master keys are only used to wrap (a set of) leaf keys. The keys that are directly used to encrypt user data represent the set of leaves of the tree. Leaf keys are stored (in wrapped form) together with the metadata about the data they encrypt, on persistent storage. The number of such leaf keys and the amount of data they encrypt can vary from system to system, ranging from one leaf key per sector through one per volume to a single leaf key for all volumes. There is a one‐to‐many mapping of leaf keys to sectors, and the concept of file is used to identify sets of sectors encrypted with the same leaf key. A file‐encryption key (FK) is the leaf key used to encrypt the sectors of the file. The FK is wrapped with the MK of the owner of the file ({FK}MK). This wrapped FK is stored as part of the metadata of the encrypted file.

This encryption approach, called indirect encryption, has multiple benefits in key management. The lifetime of MKs increases because they are used only to encrypt FKs as opposed to encrypting multiple potentially long files in their totality (NIST 2012). Additionally, when a MK is replaced (rekeying process), it is only necessary to rewrap FKs rather than re‐encrypt all data. Finally, if a FK is compromised, the attacker will only have access to that file, whereas not using indirect encryption would result in the compromise of all files encrypted with the same MK.

For each encrypted block, the SA stores an auxiliary information entry I = <LBA, IDMK, {FK}MK, algInfo > in the AuxRep, where: LBA is the lookup key for each entry and represents the address of the encrypted page, IDMK is the identifier of the master encryption key MK of the tenant‐user that issues the request, FK is the ephemeral key used to encrypt all sectors in the page, and algInfo contains information related to the encryption algorithm and mode of operation.

Now we consider data‐reduction techniques used by the Cloud Storage Provider. The following discussion is centered on fixed‐size deduplication in the cloud storage provider side because of its suitability to primary storage systems; however, the concepts presented here can be easily extended to variable‐size or file‐based deduplication. In fixed‐size deduplication, a fixed amount of input data, called a chunk, is processed in such a way as to ensure that no duplicate chunks are stored (Quinlan and Dorward 2002). For each chunk, a digest is computed and stored in a deduplication table (DedupTB) together with the physical block address (PBA) where data is stored. In Section 3.7.4, we discuss the trade‐offs of sharing deduplication tables among different tenants (cross‐tenant deduplication) as opposed to having a separate deduplication table for each tenant. Both compression and deduplication require an indirection table (IndTB), which maps logical block addresses (LBAs) into PBAs. This table is typically used by storage systems to keep track of the place where a given a data identifier is physically stored. Detailed Secure Data‐Reduction Operations

We now show how the TD handles read, write, deletion, offboarding, and rekeying requests. Designing these operations is challenging, due to the conflicting requirements: on the one hand, encryption protects user data from being accessed by any other entity; on the other, deduplication intrinsically requires sharing the content of common blocks. Additionally, every deletion operation must ensure that its requester is no longer able to retrieve the content, whereas the same content must still be accessible by other users that share it as a result of deduplication.

The descriptions that follow are simplified to focus on the most fundamental aspects of how data is transformed on its way to the disk, ignoring details such as placement on disk, I/O to multiple volumes, and length of I/O requests. A real implementation will need to consider many such details, which are presented in (Baracaldo et al. 2014, pp. 21–32). In the following, we present the general concepts. Write Requests

Write requests are initiated by the storage application sending a (write) request identifier together with an input data buffer and LBA to the coordinator module. The latter retrieves the auxiliary information indexed by the LBA and is then able to sort the request based on the IDMK specified in the auxiliary information entry. As shown in Figure 3.3a, grouping by IDMK allows the system to speed up data‐reduction operations since the MK of each user needs to be fetched only once, even for large amounts of data. The request buffer is then sliced into chunks.

Image described by caption.

Figure 3.3 Processing a write request. In Figure 3.3a, the data stream is received and split according to its MK. Then, in Figure 3.3b, the data is decrypted. When chunks cannot be deduplicated, they are compressed and then each chunk is independently encrypted with its own randomly generated Kaux. Finally, one or more encryption units are stored together in adjacent persistent storage. (a) Split input stream inflows, according to MK. (b) Compression and encryption process.

Once a full input chunk—whose FK is wrapped by the same MK—is gathered, the following steps are followed, as shown in Figure 3.3b. First, the MK is retrieved, chunks are decrypted, and the plaintext is hashed. The digest is used as a lookup key to verify whether the chunk is unique or a duplicate. When the chunk cannot be deduplicated, the compression algorithm is invoked, generating the compressed plaintext. The compressed chunk is encrypted with a freshly generated auxiliary key Kaux, resulting in an encryption unit. Before acknowledging the write operation, the TD updates both indirection and deduplication tables. For every encryption unit, an entry in the indirection table is created, indexed by the original LBA. The new entry contains: (i) the PBA of the chunk in persistent storage; (ii) the auxiliary key used to encrypt the unit in its wrapped form ({Kaux}MK); and (iii) the identifier of the master key IDMK used in the wrapping process. Finally, the deduplication table is updated.

When a chunk can be deduplicated, a lookup in the deduplication table returns LBA*, which is the location where the original chunk was written to. The TD then consults the indirection table entry indexed by LBA*. This lookup returns the address (PBA*) for that unit, the wrapped auxiliary key (images ), and the identifier of the master key used in the wrapping (images ). The CritMod contacts the KeyRep to obtain MK*, and uses it to obtain the plaintext version of the auxiliary key images . The latter is then rewrapped with the master key specified in the auxiliary information entry associated with the current write request (IDMK) to produce a new wrapped version of the auxiliary key images . Finally, a new entry in the indirection table is created, indexed by the LBA of the current request, containing the PBA of the original copy of the chunk (PBA*) and the new wrapped version of the auxiliary key (images ) and the identifier of the MK (IDMK).

As described, whenever a duplicate write comes in, the master key used to protect the original write needs to be retrieved in order to perform the subsequent unwrapping and rewrapping operations. A possible alternative to this step entails wrapping the auxiliary key with (one or more) master keys generated and maintained by the TD. This way, future write requests have no dependency on master key(s) owned and managed by the clients. Read Requests

A read request is triggered when a tenant sends a (read) request that contains an LBA. First, the TD retrieves the auxiliary information entry associated with the LBA, to retrieve the wrapped version images of the ephemeral key FK used by the client when sending the LBA to the storage system and the identifier of the master key used in the wrapping process, images . The TD then retrieves the associated entry in the indirection table and uses its content to retrieve the appropriate master key MK; with the MK, the wrapped auxiliary key is unwrapped. The PBA is used to read the encrypted unit from persistent storage; the encrypted unit is decrypted and decompressed to obtain the uncompressed plaintext page. The ephemeral key, FK, originally used to encrypt the page—stored in wrapped form in the associated auxiliary information entry—is then unwrapped. Finally, the uncompressed chunk is re‐encrypted using the FK, and the result is returned to the client. Rekeying Requests

One of the important use cases for the system is allowing clients to change their master keys. A rekeying request is initiated by the client, which sends the identifier of two master keys, IDMK_old and IDMK_new, requesting that all auxiliary keys currently wrapped with IDMK_old be unwrapped and rewrapped with IDMK_new. The TD can honor this request by scanning all entries in the indirection table and perform the rewrap operation when necessary. Notice that no bulk re‐encryption needs to take place. File Deletion

The user of the storage application may decide at any time to erase a file. Typically, file‐deletion operations only entail the removal of file‐system metadata (e.g. removal of a directory entry and marking the inode as free). Nevertheless, the storage applications have two ways of notifying the TD that a file has been deleted and—as a consequence—that the LBAs used to store its content can be reclaimed: implicitly, by issuing a new write request for the same LBA; or explicitly, by issuing a command requesting the TD to reclaim the LBAs holding the deleted file. The TD handles these requests by erasing the indirection table entries associated with the unmapped or overwritten LBAs. In addition, for the case of deduplicated chunks, (i) the TD needs to handle a reference counter for the original chunk, to be able to decide whether the PBA of the original chunk can be reused; and (ii) if the original LBAs are unmapped but its reference counter is greater than zero, the TD needs to ensure that the DedupTB entry pointing to the deleted chunk is removed, and if required by the deduplication design, a DedupTB entry is added for another chunk with the same content. Offboarding Requests

Offboarding requests are a special case of file‐deletion requests, wherein the set of unmapped or overwritten LBAs spans entire volumes or large portions thereof. These requests are assumed to take place when a tenant no longer requires the services of the storage provider. To honor these requests, the TD proceeds to erase all metadata that belongs to the tenant, and to remove from cache any keys possibly in use. After this step completes, data is no longer accessible by the cloud provider or the TD, since neither has access to the master keys required to decrypt it. Deduplicated chunks can be handled as described in section Secure Data Deletion

Secure data deletion, as described in (Cachin et al. 2013), can be achieved by tenants with a combination of: (i) file deletion (or offboarding) requests that remove the wrapped version of auxiliary keys used to encrypt the chunks of the deleted file; (ii) generation of a new master key (IDMK_new); and (iii) rekeying requests to rewrap all auxiliary keys previously wrapped with IDMK_old. IDMK_old is the master key used to wrap the auxiliary key that encrypts the set of deleted chunks. Finally, the tenant can request its KeyRep to destroy IDMK_old. After this step is completed, the previously unmapped chunks can no longer be decrypted.

The special case of deleting deduplicated chunks needs to be discussed: deduplication inevitably creates a trade‐off between the abilities to erase data and to use storage space efficiently. A deduplicated page is clearly still accessible even after the aforementioned secure deletion steps are taken, because multiple cryptographic paths exist to decrypt the relevant auxiliary keys. While a complete analysis of this trade‐off is out of the scope of this section, we present a few relevant arguments in Section 3.7.4.

3.7.3 Securing the Critical Module

Since the CritMod has momentary access to users' data in plaintext and to the master keys, the security of the overall scheme depends on whether the storage‐access adversary is able to compromise it. Different implementations of the CritMod are possible: their ability to meet the security requirements of the scheme depends on the assumptions about the surrounding context. For example, a plain software implementation that does not use hardware security modules (see Section 3.7.4. for details) can be considered secure only if the appropriate physical, procedural, and regulatory controls strengthen the limited technical controls available to restrict access to the TD process.

Implementations that leverage trusted execution environments (TEEs) can count on stronger technical controls, enabling the (trusted) TD process to run on an untrusted platform. The properties of this solution can be summarized as follows: (i) no other process in the hosting machine, not even a privileged one, can access the memory of the CritMod; (ii) the TD metadata does not leave the CritMod in plaintext form; (iii) data provided by the user is always stored in encrypted form; and (iv) once a deletion request from a tenant to erase its keys from the system is received, the CritMod should erase all the corresponding keys in use and all its corresponding metadata in IndTB and DedupTB. Unfortunately, TEEs cannot guarantee that the TD process cannot be exploited by supplying a sequence of inputs that subvert its control flow. To ensure against this type of threat, the code of the CritMod needs to be verified before deployment to ensure that it does not contain any backdoors or vulnerabilities. This can be achieved by automatic tools such as (CBMC 2018), inspecting the code manually, following secure software development procedures, and making the code of the CritMod public.

Another property of TEE‐based solutions is their ability to execute remote attestation protocols, by means of which the CritMod code can authenticate to the KeyRep, thus establishing the root of trust necessary to allow the exchange of master keys. This allows an attestation protocol to take place when the key‐retrieval module fetches master keys from a KeyRep. This occurs when the function GetKey is invoked. During the attestation protocol, the key‐retrieval module contacts the KeyRep, which replies with a challenge to verify whether the CritMod can be trusted. Only when the attestation protocol is successful does the KeyRep send the requested key to the CritMod. If the process is successful, the CritMod can perform the storage operations. For efficiency reasons, master keys may temporarily be cached so as to only occasionally incur the overhead of attestation. The attestation process requires previous certification of the code, which means the KeyRep needs to be configured with the expected measurements of the CritMod, so that its integrity can be verified. In order for the attestation protocol to work properly, it is necessary to provision the root of trust and isolation platform with a private and a public key, so that the platform can attest the current state of the system to the KeyRep. The key‐provisioning process needs to be performed in a secure way to ensure that the KeyRep and the TD administrators do not collude during the setup process; otherwise, a cuckoo attack (Parno 2008) could be performed, causing the KeyRep to inadvertently leak keys. In a cuckoo attack, if an attacker manages to set a bogus certificate of the TD in the KeyRep, the KeyRep will inadvertently leak the keys.

Finally, the TD needs to preserve the confidentiality and integrity of the deduplication and indirection tables. These objectives can be enforced technically (e.g. with secure hardware) and organizationally (e.g. with appropriate processes and workflows) while the entries reside in the main memory of the TD, and cryptographically when the TD commits the entries to persistent storage (e.g. by encrypting them with cryptographic material that never leaves the CritMod). This protection will be effective against a storage‐access adversary, but not effective against a law‐enforcement adversary.

3.7.4 Security Analysis Data Confidentiality

First, we discuss how the confidentiality objective is addressed. An adversary cannot access data while in flight: this follows from our assumption of mutually authenticated secure channels providing perfect forward secrecy. We assume that mutual authentication provides sufficient means to thwart impersonation attacks on both the client side and the cloud side.

Once data reaches the TD, it is re‐encrypted using an auxiliary key and then stored encrypted on the storage media. The auxiliary key exists in wrapped form in the indirection table, and occasionally in the main memory of the CritMod. Consider an adversary who is able to access the persistent storage physically or remotely and can subvert the operating system environment where the TD is hosted (e.g. obtain root privileges); henceforth, we refer to this adversary as a storage‐access adversary (this adversary covers the curious administrator and malicious tenant presented in Section 3.3). This adversary can gain access to the chunks on disk, but these are encrypted with auxiliary keys. In light of the security provisions for the CritMod, we assume that this adversary cannot compromise the CritMod, either by subverting the integrity of its control flow or by accessing security‐sensitive data and metadata in plaintext, which may temporarily be present in the main memory of the CritMod. Admittedly, this is a strong assumption, and it is by no means straightforward to implement. The ways of achieving this in practice are outside of the scope of this chapter and represent a security engineering challenge more than they do a research one (see Section 3.7.3 for a more detailed discussion of the subject). Auxiliary keys are present in plaintext in the memory of the TD and wrapped with a master key in the indirection table entries. We assume that—by means of a combination of technical, procedural, and regulatory controls—this adversary has no access to plaintext in the main memory of the TD. Indirection table entries are similarly protected. Additionally, even when in possession of plaintext indirection table entries, this adversary wouldn't be able to unwrap the auxiliary keys; they are wrapped by tenant master keys, and the latter are stored in the key repository, which is controlled by the tenant and is trusted to provide keys uniquely to key owners or a properly attested and authenticated CritMod.

Consider a more powerful adversary, which we call a law‐enforcement adversary, who can access the content of the storage media, can compel tenants to reveal the cryptographic material stored in the KeyRep, and can request the storage provider to produce the content of the internal metadata of the TD (e.g. the indirection table). The only assumption we make is that the law‐enforcement adversary cannot access master keys that have been destroyed after an explicit user request. It is apparent that—based on their capabilities—this adversary can compromise the confidentiality of user data. However, note that the same is also true in the absence of the TD. This type of adversary is, however, unable to access data that has been securely deleted by means of a sequence of rekeying operations and destruction of master keys in the KeyRep. This is true because this adversary can recover chunks encrypted by a set of auxiliary keys and then wrapped with a master key MKold; however, the latter is assumed to be unrecoverable after the tenant has requested its destruction to its KeyRep. This adversary is also unable to recover data stored on the same storage medium as the targeted client but encrypted by another tenant whose keys have not been revealed to the adversary. Data Confidentiality in the Presence of Deduplication

Deduplication represents a threat to confidentiality because it offers an oracle that can be consulted to discover whether the same content has already been uploaded, as described in (Halevi et al. 2011, pp. 491–500). An adversary can consult this oracle in an online fashion by colluding with a malicious tenant, whereas the law‐enforcement adversary has direct access to the deduplication table and therefore has offline access to this oracle.

This vulnerability exists in every system that supports deduplication, and it is neither thwarted nor worsened by the existence of the TD. However, cross‐tenant deduplication can increase the scope of the vulnerability since the oracle may be consulted by malicious tenants that want to compromise the confidentiality of a victim tenant's data. Avoiding this breach of confidentiality can be achieved by restricting deduplication to be intra‐tenant only.

Focusing on the law‐enforcement adversary, while this adversary cannot revert secure deletion, traces of deleted chunks may still be part of the system in the form of duplicates of the deleted chunks. However, the mere existence of a deduplicated page, disconnected from its original context (the ordered sequence of chunks that formed the deleted file), doesn't constitute significant leakage to a law‐enforcement adversary: LBAs can be reused, and the tenant can always claim that all remaining chunks have no relation to the deleted page. Also, if cross‐tenant deduplication is not allowed, tenant offboarding in conjunction with the destruction of master keys is effective, as no traces of existing data can be gathered. However, it is clear that if cross‐tenant deduplication were allowed, a law‐enforcement adversary could collude with another tenant and use deduplication as an effective way of establishing a connection between existing chunks before and after deletion, using the colluding tenant as a mean of circumventing deletion for a set of sensitive chunks. The law‐enforcement adversary would indeed be able to use the colluding tenant to issue write requests for a set of sensitive chunks. If deduplication takes place, and if no duplicates are found after secure deletion is committed, the law‐enforcement adversary could deduce that such chunks did exist and were the subject of secure deletion. This type of attack is extremely effective, but only for chunks with low conditional entropy. Avoiding this breach of confidentiality is another reason to disable cross‐tenant deduplication in a TD system. Security Impact of Different Technologies

There are different implementation possibilities for the CritMod, and their impact on the security of our framework varies. The alternatives vary according to the amount of trust that is initially placed in different components of the system and in the hardware required to ensure secure execution. In general, the smaller the size of the code that needs to be trusted, also known as the trusted code base (TCB), the more secure it is. We classify the isolation techniques as hardware‐based isolation (HBI) and virtualized‐based isolation (VBI). HBI solutions such as SecureBlue++ (Boivie 2012), (Lie et al. 2000; Suh et al. 2003, pp. 160–171; Williams and Boivie 2011; Intel 2018b), and Software Guard Extensions (SGX) (Intel, 2018a) use special hardware instructions to isolate critical applications from other processes that run in the same host, whereas VBI solutions such as TrustVisor (McCune et al. 2010, pp. 143–158), Overshadow (Chen et al. 2008, pp. 2–13), and (Vasudevan et al. 2012) depend on virtualization to provide isolation between trusted and untrusted code. Filtering out approaches that do not allow multithreading and simultaneous instances, because they would not allow the required performance, the TD could be implemented using HBI or VBI approaches with TPM.

HBI solutions are more secure than VBI solutions because they only rely on the critical application and the hardware itself. Furthermore, SecureBlue++ allows remote attestation and provides a higher level of assurance than other approaches: it permits continuous software‐integrity protection, not only at boot‐time. Among VBI solutions, TrustVisor has the smallest hypervisor, but it emulates the TPM operations in software to improve their performance. Overshadow has a larger TCB and requires the use of a TPM, making it less secure. In contrast, the implementation cost is larger for HBI solutions because they are not widely available, while VBI solutions do not require specialized hardware.

3.7.5 TD Overhead and Performance Implications

A prototype of the TD (see Figure 3.4) was implemented to understand its feasibility and measure the overhead. The tenant side was represented by the IBM Spectrum Scale with encryption support enabled (IBM 2018b), generating encrypted files and transmitting them to the cloud storage provider. Spectrum Scale was modified to store the encryption auxiliary information in AuxRep via User Datagram Protocol (UDP) messages. The key repository KeyRep was hosted in an IBM Security Key Lifecycle Manager (https://www.ibm.com/support/knowledgecenter/en/SSWPVP). The cloud storage provider was built as a network block device (NBD) (nbd 2018). The AuxRep was built as part of nbd‐server. Advanced Encryption Standard (AES128) was used for encryption, and the prototype was implemented using C++.

Diagram of TD prototype system with application server connected to IBM security key lifecycle manager (key server) and network block device. The latter is connected to TD and data.

Figure 3.4 TD prototype system.

The experiments were performed on a machine with an Intel Xeon [email protected] GHz CPU with 18 GB of RAM, an IBM 42D0747 7200 RPM SATA HDD, an OCZ Vertex 2 SATA SSD, and a 64‐bit operating system. For cryptographic operations, results were averaged across 50 independent repetitions of the measurements. To obtain the measurements of the disk accesses, 10 000 random accesses were performed to a file of 16 GB. The data presented is the average of all the repetitions. In our experiments, it was assumed that the master key used to encrypt the data was cached in the system. All figures of results are in logarithmic scale.

More details about the system and measurements made can be found in (Baracaldo et al. 2014, pp. 21–32). However, we summarize the most important results here. The measurements focused on the main concern of TD data‐path performance and the cryptographic operations relative to the other data‐path latencies, such as access to the storage medium. It was assumed that overheads and latencies for fetching keys and auxiliary information can be optimized by caching algorithms and therefore will not be generally seen as a significant contributor to TD overhead.

Figure 3.5 shows the relative time consumed by the various cryptographic operations relative to the read to the HDD storage medium, for a range of typical primary storage system chunk sizes. It confirms that encrypting and decrypting data incur the most latency, by about a factor of 10, and are about a factor of 30–100 less than the actual HDD storage medium accesses.

Clustered bar graph of overhead of data‐path operations for read requests, depicting create key, encrypt, decrypt, wrap key, unwrap key, and read HDD for 4KB, 12KN and 16KB chunk sizes.

Figure 3.5 Overhead of data‐path operations for read requests.

Figure 3.6 suggests that the aggregate overhead related to cryptographic operations is very small relative to HDD access times, as also indicated in Figure 3.5. For SSD, the overhead, with the low level of instruction‐set acceleration present in the prototype, is on the same order of magnitude as the access to the storage medium.

Clustered bar graph of crypto operations, read HHD, and read SSD for 4Kb, 12KB, and 16KB chunk sizes.

Figure 3.6 Comparing overheads for reads to HDD and SSD.

Not shown here are overheads related to a base data‐reduction system: compressing the data and the hash calculations normally associated with generating deduplication signatures. Those calculations are also substantial: compression and decompression calculations generally take on the order of 1.5–3 times (per byte) the overhead of AES encryption, and Secure Hash Algorithm (SHA‐1) calculations (often used for deduplication signatures) often take on the order of 2–3 times more overhead than AES calculations for encryption/decryption (although the signature is only calculated on write operations). This suggests that the overhead of TD‐related crypto operations will likely be even less substantial than these charts suggest, even for SSD.

In addition, increasingly capable hardware acceleration is being provided for compression/decompression calculations, for encryption/decryption calculations, and for secure hash calculations such as SHA‐1 and SHA‐2. For example, hardware accelerators such as Intel QuickAssist Technology (Intel 2018c) have compression and crypto engines, and instruction support such as Intel SSE4 (Intel 2017) is continuing to accelerate crypto, compression, and hashing algorithms.

Therefore, the data‐path overhead for a TD operating in an HDD storage system today looks very reasonable. For SSD, the extra overhead imposed by TD is significant, but within the same order of magnitude as the compression and deduplication functions. However, since SSD is quite expensive relative to HDD, the extra cost to provide more compute resource in the TD‐enabled storage system per byte of SSD storage is likely to be practical, still providing an affordable TD‐enabled SSD storage platform. It appears that the TD data‐path overheads are practical for HDD and SSD storage systems and will only become more practical over the next several generations as hardware continues to enhance compute resources for crypto operations.

Another area of overhead could be that associated with the TEE for the TD. The overhead incurred by these mechanisms is the price paid for achieving complete process isolation as well as providing tenants with verifiable trust assurance of the state of the TD provided by their storage provider. Hardware methods such as SGX and SecureBlue++ will incur very little overhead and will be the most secure; they hold great promise for TD‐enabled storage systems with the highest security guarantees.

3.8 Future Directions for Cloud Storage Confidentiality with Low Cost

Multiple challenges need to be met to achieve full confidentiality at an affordable price in cloud storage systems.

3.8.1 Hardware Trends

Encryption and data‐reduction computation costs are decreasing but still significant in servers in terms of CPU cycles and memory‐bandwidth usage. In addition, the latency introduced by encryption can be very significant for some use cases. Processor instructions designed to speed up cryptographic or compression calculations, other types of hardware acceleration, and decreasing cost of cores in CPUs will increasingly mitigate this problem.

3.8.2 New Cryptographic Techniques

In the short term, cost‐savings solutions such as those detailed in this chapter will help reduce cost by reducing capacity needed. In the long term, cryptographic techniques to perform computations over encrypted data, without revealing plaintext to untrusted third parties such as cloud storage providers, will be feasible and will allow data reduction on encrypted data. Homomorphic encryption (Gentry 2009) is one promising encryption method that still needs to be improved to achieve faster computation times. Designing such systems is one of the research tracks that can significantly improve the security of cloud systems.

3.8.3 Usability of Cryptographic Key Management

Key management today is difficult, is expensive to set up and maintain, and, once set up, is not easy enough for users. Key managers are often targeted by attackers who intend to steal data, as was the case in a recent security incident that targeted Amazon Web Services (CRN 2017). Operations such as distribution of keys, storage, and rekeying operations are still difficult to perform.

Usability of key management as well as other items related to seamless and integrated use of encryption by users will need to be improved to drive adoption rates. Efforts such as (Cachin et al. 2013) try to automate this process while achieving secure deletion. Federated key management spanning on‐premises and multiple clouds will also be an enabler for the Trusted Decrypter and other similar systems, as long as those systems are designed carefully with strong security.

3.8.4 Trusted Execution Environments

Increasing the capabilities and availability of trustworthy execution environments and decreasing the costs associated with them will allow wider adoption of architectures such as Trusted Decrypters and will remove inhibitors to wider adoption. Trusted execution technologies and process‐isolation techniques are mechanisms that can help protect and provide guarantees to tenants, as is the case in the TD architecture presented in Section 3.7. Current hardware components for trusted execution environments are not widely available in CPUs and servers. Although TPMs are generally available in many CPUs, their key‐storage space is limited. And secure computing hardware such as Intel SGX (Intel 2018a) is not yet commonly used. Additionally, some challenges still need to be addressed for trusted execution environment technologies to be fully practical. Processes like key provisioning in cloud environments should be performed carefully to avoid man‐in‐the middle attacks (Parno 2008). New mechanisms to automate this process and avoid such pitfalls are needed.

Application software needs to be updated and patched often, causing the signatures of the system to change. Systems to distribute correct and up‐to‐date signature values in a secure way are necessary to avoid false positives.

In Section 3.7, we presented some methodologies to achieve data confidentiality through process isolation. Hypervisor‐based process‐isolation technologies seem to be a fast and more economical way to go forward. However, these solutions still have a large trusted code base that must be trusted. New ways to minimize the risk of trusting compromised hypervisors are needed.

Containers rather than hypervisors to isolate tenants are in wide use, and systems like Cloud Foundry (www.cloudfoundry.org), Amazon Web Services (https://aws.amazon.com), and Bluemix (https://www.ibm.com/cloud) effectively use containers to isolate tenants. A container is an isolated process belonging to a tenant. Each tenant is assigned a container, and multiple tenants share the same operating system. One of the main advantages of this system is that provisioning a new container is much faster than starting a VM image, allowing clients to bring up new applications faster. Using this type of isolation clearly increases the risk of exposure of tenants. Potential adversaries may share the same operating system and may try to compromise the system to obtain access to sensitive data of other tenants or to launch a denial of service attack (Zhang et al., 2014, pp. 990–1003; Catuogno and Galdi 2016, pp. 69–76). Research efforts to mitigate the risk of information leakage imposed by containers are currently underway (Gao et al. 2017, pp. 237–248; Henriksson and Falk 2017).

3.8.5 Privacy and Side‐Channel Attacks

Existing systems do not offer confidentiality and privacy of metadata, such as filenames and folder paths, stored in the Cloud. In the systems presented in this chapter, it is assumed that such metadata is not confidential and can be freely shared with cloud storage providers. Sometimes, just getting to know the existence of a particular filename may allow adversaries to infer information about the victim. For example, if a file is named companyA‐acquisition‐budget.xls, an adversary may know that the tenant is at least analyzing the possibility of acquiring companyA. Protecting the confidentiality of storage systems' metadata is a complex problem that still needs to be studied by the research community. To the best of our knowledge, the problem of achieving data‐reduction techniques and maintaining confidentiality of data has not been studied in the presence of confidential metadata.

During our analysis of data‐reduction‐enabled cloud storage systems, we presented multiple side‐channel attacks: e.g. an adversary observing the bandwidth used, to determine if a particular piece of data was stored, can infer whether that piece of data was previously stored in the storage system. There are many more side‐channel threats that are relevant to cloud storage systems. For example, side‐channel threats occur when adversaries can observe patterns of data usage to infer tenants' intentions, even if data is encrypted (Van Dijk and Juels 2010, pp. 1–8). Private information retrieval (Chor et al. 1998) and oblivious storage (Stefanov et al. 2011; Goodrich et al. 2012, pp. 157–167; Chan et al. 2017, pp. 660–690) are techniques that obfuscate usage patterns to thwart these types of attacks; however, they can cause performance and cost issues (e.g. lower performance due to lower cache‐hit rates in storage systems and lower “real” throughput achieved per storage system and network). Given these costs, novel and more efficient techniques to deal with these privacy side‐channel attacks are necessary.

More research to identify and deter new and existing side‐channel attacks is needed. In particular, additional cost savings from multitenant deduplication will be enabled once there is no concern about side‐channel attacks.

3.9 Conclusions

Cloud storage systems have become increasingly popular, but their success depends on their ability to protect the confidentiality of the information stored by tenants while offering low‐cost services. In this chapter, we overviewed some of the techniques commonly used to maintain data confidentiality and showed how these techniques can reduce the efficacy of traditionally used data‐reduction techniques such as compression and deduplication, increasing costs for both tenants and cloud storage providers. To restore cost savings, we presented multiple techniques that alleviate this problem without compromising cloud data confidentiality. We detailed the Trusted Decrypter architecture as a feasible way to reconcile data‐reduction techniques and confidentiality in the near future. In the long term, we expect that cloud storage systems will continue to evolve by leveraging increasingly faster and less expensive hardware to facilitate cryptographic and compression operations to protect data confidentiality without large overheads. We also anticipate that new cryptographic techniques to perform data reduction over encrypted data will become feasible. There is a long journey for these new cryptographic techniques to be considered secure by the community and provide adequate performance at low cost before they can be adopted in real systems. However, their development will play a positive role in cloud storage providers' ability to offer low‐cost, secure data services.


