Chapter 8: Encryption in Hadoop

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 8

Encryption in Hadoop

Recently, I was talking with a friend about possibly using Hadoop to speed up reporting on his company’s “massive” data warehouse of 4TB. (He heads the IT department of one of the biggest real estate companies in the Chicago area.) Although he grudgingly agreed to a possible performance benefit, he asked very confidently, “But what about encrypting our HR [human resources] data? For our MS SQL Server–based HR data, we use symmetric key encryption and certificates supplemented by C# code. How can you implement that with Hadoop?”

As Hadoop is increasingly used within corporate environments, a lot more people are going to ask the same question. The answer isn’t straightforward. Most of the Hadoop distributions now have Kerberos installed and/or implemented and include easy options to implement authorization as well as encryption in transit, but your options are limited for at-rest encryption for Hadoop, especially with file-level granularity.

Why do you need to encrypt data while it’s at rest and stored on a disk? Encryption is the last line of defense when a hacker gets complete access to your data. It is a comforting feeling to know that your data is still going to be safe, since it can’t be decrypted and used without the key that scrambled it. Remember, however, that encryption is used for countering unauthorized access and hence can’t be replaced by authentication or authorization (both of which control authorized access).

In this chapter, I will discuss encryption at rest, and how you can implement it within Hadoop. First, I will provide a brief overview of symmetric (secret key) encryption as used by the DES and AES algorithms, asymmetric (public key) encryption used by the RSA algorithm, key exchange protocols and certificates, digital signatures, and cryptographic hash functions. Then, I will explain what needs to be encrypted within Hadoop and how, and discuss the Intel Hadoop distribution, which is now planned to be offered partially with Cloudera’s distribution and is also available open source via Project Rhino. Last, I will discuss how to use Amazon Web Services’s Elastic MapReduce (or VMs preinstalled with Hadoop) for implementing encryption at rest.

Introduction to Data Encryption

Cryptography can be used very effectively to counter many kinds of security threats. Whether you call the data scrambled, disguised, or encrypted, it cannot be read, modified, or manipulated easily. Luckily, even though cryptography has its origin in higher mathematics, you do not need to understand its mathematical basis in order to use it. Simply understand that a common approach is to base the encryption on a key (a unique character pattern used as the basis for encryption and decryption) and an algorithm (logic used to scramble or descramble data, using the key as needed). See the “Basic Principles of Encryption” sidebar for more on the building blocks of encryption.

BASIC PRINCIPLES OF ENCRYPTION

As children, my friends and I developed our own special code language to communicate in school. Any messages that needed to be passed around during class contained number sequences like “4 21 8 0 27 18 24 0 6 18 16 12 17 10” to perplex our teachers if we were caught.

Our code is an example of a simple substitution cipher in which numbers (signifying position within the alphabet) were substituted for letters and then a 3 was added to each digit; 0 was used as a word separator. So, the above sequence simply asked the other guy “are you coming”? While our code was very simple, data encryption in real-world applications uses complex ciphers that rely on complex logic for substituting the characters. In some cases, a key, such as a word or mathematical expression, is used to transpose the letters. So, for example, using “myword” as a key, ABCDEFGHIJKLMNOPQRSTUVWXYZ could map to mywordabcefghijklnpqstuvwxyz, meaning the cipher text for the phrase “Hello world” would be “Brggj ujngo”. To add complexity, you can substitute the position of a letter in the alphabet for x in the expression 2x + 5 £ 26 to map ABCDEFGHIJKLMNOPQRSTUVWXYZ to gikmoqsuwyabcdefhjlnprtvxz. Complex substitution ciphers can be used for robust security, but a big issue is the time required to encrypt and decrypt them.

The other method of encryption is transposition (also called reordering, rearranging, or permutation). A transposition is an encryption where letters of the original text are rearranged to generate the encrypted text. By spreading the information across the message, transposition makes the message difficult to comprehend. A very simple example of this type of encryption is columnar transposition, which involves transposing rows of text to columns. For example, to transpose the phrase “CAN YOU READ THIS NOW” as a six-column transposition, I could write the characters in rows of six and arrange one row after another:

C A N Y O U
R E A D T H
I S N O W

The resulting cipher text would then be read down the columns as: “cri aes nan ydo otw uh”. Because of the storage space needed and the delay involved in decrypting the cipher text, this algorithm is not especially appropriate for long messages when time is of the essence.

Although substitution and transposition ciphers are not used alone for real-world data encryption, their combination forms a basis for some widely used commercial-grade encryption algorithms.

Popular Encryption Algorithms

There are two fundamental key-based encryptions: symmetric and asymmetric. Commonly called secret keys, symmetric algorithms use the same key for encryption as well as decryption. Two users share a secret key that they both use to encrypt and send information to the other as well as decrypt information from the other—much as my childhood friends and I used the same number-substitution key to encode the notes we passed in class. Because a separate key is needed for each pair of users who plan to use it, key distribution is a major problem in using symmetric encryption. Mathematically, n users who need to communicate in pairs require n × (n – 1)/2 keys. So, the number of keys increases almost exponentially with number of users. Two popular algorithms that use symmetric key are DES and AES (more on these shortly).

Asymmetric or public key systems don’t have the issues of key distribution and exponential number of keys. A public key can be distributed via an e-mail message or be copied to a shared directory. A message encrypted using it can be decrypted using the corresponding private key, which only the authorized user possesses. Since a user (within a system) can use any other user’s public key to encrypt a message meant for him (that user has a corresponding private key to decrypt it), the number of keys remains small—two times the number of users. The popular encryption algorithm RSA uses public key. Public key encryption, however, is typically 10,000 times slower than symmetric key encryption because the modular exponentiation that public key encryption uses involves multiplication and division, which is slower than the bit operations (addition, exclusive OR, substitution, shifting) that symmetric algorithms use. For this reason, symmetric encryption is used more commonly, while public key encryption is reserved for specialized applications where speed is not a constraint. One place public key encryption becomes very useful is symmetric key exchange: it allows for a protected exchange of a symmetric key, which can then be used to secure further communications.

Symmetric and asymmetric encryptions, and DES, AES, and RSA in particular, are used as building blocks to perform such computing tasks as signing documents, detecting a change, and exchanging sensitive data, as you’ll learn in the “Applications of Encryption” section. For now, take a closer look at each of these popular algorithms.

Data Encryption Standard (DES)

Developed by IBM from its Lucifer algorithm, the data encryption standard (DES) was officially adopted as a US federal standard in November 1976 for use on all public- and private-sector unclassified communication. The DES algorithm is a complex combination of two fundamental principles of encryption: substitution and transposition. The robustness of this algorithm is due to repeated application of these two techniques for a total of 16 cycles. The DES algorithm is a block algorithm, meaning it works with a 64-bit data block instead of a stream of characters. It splits an input data block in half, performs substitution on each half separately, fuses the key with one of the halves, and finally swaps the two halves. This process is performed 16 times and is detailed in the “DES Algorithm” sidebar.

DES ALGORITHM

For the DES algorithm, the first cycle of encryption begins when the first 64 data bits are transposed by initial permutation. First, the 64 transposed data bits are divided into left and right halves of 32 bits each. A 64-bit key (56 bits are used as the key; the rest are parity bits) is used to transform the data bits. Next, the key gets a left shift by a predetermined number of bits and is transposed. The resultant key is combined with the right half (substitution) and the result is combined with the left half after a round of permutation. This becomes the new right half. The old right half (one before combining with key and left half) becomes the new left half. This cycle (Figure 8-1) is performed 16 times. After the last cycle is completed, a final transposition (which is the inverse of the initial permutation) is performed to complete the algorithm.

Figure 8-1. Cycle of the DES algorithm

Because DES limits its arithmetic and logical operations to 64-bit numbers, it can be used with software for encryption on most of the current 64-bit operating systems.

The real weakness of this algorithm is against an attack called differential cryptanalysis in which a key can be determined from chosen cipher texts in 258 searches. The cryptanalytic attack has not exposed any significant, exploitable vulnerability in DES, but the risks of using the 56-bit key are increasing with easy availability of computing power. Although the computing power or time needed to break DES is still significant, a determined hacker can certainly decrypt text encrypted with DES. If a triple-DES approach (invoking DES three times for encryption using the sequence: encryption via Key1, decryption using Key2, encryption using Key3) is used, the effective key length becomes 112 (if only two of the three keys are unique) or 168 bits (if Key1, Key2, and Key3 are all unique), increasing the difficulty of attack exponentially. DES can be used in the short term, but is certainly at end-of-life and needs to be replaced by a more robust algorithm.

Advanced Encryption Standard (AES)

In 1997, the US National Institute of Standards and Technology called for a new encryption algorithm; subsequently, Advanced Encryption Standard (AES) became the new standard in 2001. Originally called Rijndael, AES is also a block cipher and uses multiple cycles, or rounds, to encrypt data using an input data block size of 128. Encryption keys of 128, 192, and 256 bits require 10, 12, or 14 cycles of encryption, respectively. The cycle of AES is simple, involving a substitution, two permuting functions, and a keying function (see the sidebar “AES Algorithm” for more detail). There are no known weaknesses of AES and it is in wide commercial use.

AES ALGORITHM

To help you visualize the operations of AES, let me first assume input data to be 9 bytes long and represent the AES matrix as a 3 × 3 array with the data bytes b0 through b8.

Depicted in Figure 8-2, each round of the AES algorithm consists of the following four steps:

Substitute: To diffuse the data, each byte of a 128-bit data block is substituted using a substitution table.
Shift row: The rows of data are permuted by a left circular shift; the first (leftmost, high order) n elements of row n are shifted around to the end (rightmost, low order). Therefore, a row n is shifted left circular (n – 1) bytes.
Mix columns: To transform the columns, the three elements of each column are multiplied by a polynomial. For each element the bits are shifted left and exclusive-ORed with themselves to diffuse each element of the column over all three elements of that column.
Add round key: Last, a portion of the key unique to this cycle (subkey) is exclusive-ORed or added to each data column. A subkey is derived from the key using a series of permutation, substitution, and ex-OR operations on the key.

Figure 8-2. Cycle of the AES algorithm

Rivest-Shamir-Adelman Encryption

With DES, AES, and other symmetric key algorithms, each pair of users needs a separate key. Each time a user (n + 1) is added, n more keys are required, making it hard to track keys for each additional user with whom you need to communicate. Determining as well as distributing these keys can be a problem—as can maintaining security for the distributed keys because they can’t all be memorized. Asymmetric or public key encryption, however, helps you avoid this and many other issues encountered with symmetric key encryption. The most famous algorithm that uses public key encryption is the Rivest-Shamir-Adelman (RSA) algorithm. Introduced in 1978 and named after its three inventors (Rivest, Shamir, and Adelman), RSA remains secure to date with no serious flaws yet found. To understand how RSA works, see the “Rivest-Shamir-Adelman (RSA) Encryption” sidebar.

RIVEST-SHAMIR-ADELMAN (RSA) ENCRYPTION

The RSA encryption algorithm combines results from number theory with the degree of difficulty in determining the prime factors of a given number. The RSA algorithm operates with arithmetic mod n; mod n for a number P is the remainder when you divide P by n.

The two keys used in RSA for decryption and encryption are interchangeable; either can be chosen as the public key and the other can be used as private key. Any plain text block P is encrypted as P^e mod n. Because the exponentiation is performed, mod n and e as well as n are very large numbers (e is typically 100 digits and n typically 200), and factoring P^e to decrypt the encrypted plain text is almost impossible. The decrypting key d is so chosen that (P^e)^d mod n = P. Therefore, the legitimate receiver who knows d can simply determine (P^e)^d mod n = P and thus recover P without the need to factor P^e. The encryption algorithm is based on the underlying problem of factoring large numbers, which has no easy or fast solution.

How are keys determined for encryption? If your plain text is P and you are computing P^e mod n, then the encryption keys will be the numbers e and n, and the decryption keys will be d and n. A product of the two prime numbers p and q, the value of n should be very large, typically almost 100 digits or approximately 200 decimal digits (or 512 bits) long. If needed, n can be 768 bits or even 1024 bits. Larger the value of n, larger the degree of difficulty in factoring n to determine p and q.

As a next step, a number e is chosen such that e has no factors in common with (p - 1) × (q - 1). One way of ensuring this is to choose e as a prime number larger than (p - 1) as well as (q - 1).

Last, select such a number d that mathematically:

e × d = 1 mod (p - 1) × (q - 1)

As you can see, even though n is known to be the product of two primes, if they are large, it is not feasible to determine the primes p and q or the private key d from e. Therefore, this scheme provides adequate security for d. That is also the reason RSA is secure and used commercially. It is important to note, though, that due to improved algorithms and increased computing power RSA keys up to 1024 bits have been broken (though not trivially by any means). Therefore, the key size considered secure enough for most applications is 2048 bits; for more sensitive data, you should use 4096 bits.

Digital Signature Algorithm Encryption (DSA)

Another popular algorithm using public key encryption is DSA (Digital Signature Algorithm). Although the original purpose of this algorithm was signing, it can be used for encrypting, too. DSA security has a theoretical mathematical basis based on the discrete logarithm problem and is designed using the assumption that a discrete logarithm problem has no quick or efficient solution. Table 8-1 compares DSA with RSA.

Table 8-1. DSA vs. RSA

Attribute	DSA	RSA
Key generation	Faster
Encryption		Faster
Decryption	Faster^**
Digital signature generation	Faster
Digital signature verification		Faster
Slower client		Preferable
Slower server	Preferable

^**Please note that “Faster” also implies less usage of computational resources

To summarize, DSA and RSA have almost the same cryptographic strengths, although each has its own performance advantages. In case of performance issues, it might be a good idea to evaluate where the problem lies (at the client or server) and base your choice of key algorithm on that.

Applications of Encryption

In many cases, one type of encryption is more suited for your needs than another, or you may need a combination of encryption methods to satisfy your needs. Four common applications of encryption algorithms that you’ll encounter are cryptographic hash functions, key exchange, digital signatures, and certificates. For HDFS, client data access uses TCP/IP protocol, which in turn uses SASL as well as data encryption keys. Hadoop web consoles and MapReduce shuffle use secure HTTP that uses public key certificates. Intel’s Hadoop distribution (now Project Rhino) uses symmetric keys for encryption at rest and certificates for encrypted data processing through MapReduce jobs. To better appreciate how Hadoop and others use these applications, you need to understand how each works.

Hash Functions

In some situations, integrity is a bigger concern than secrecy. For example, in a document management system that stores legal documents or manages loans, knowing that a document has not been altered is important. So, encryption can be used to provide integrity as well.

In most files, components of the content are not bound together in any way. In other words, each character is independent in a file, and even though changing one value affects the integrity of the file, it can easily go undetected. Encryption can be used to “seal” a file so that any change can be easily detected. One way of providing this seal is to compute a cryptographic function, called a hash or checksum, or a message digest, of the file. Because the hash function depends on all bits of the file being sealed, altering one bit will alter the checksum result. Each time the file is accessed or used, the hash function recomputes the checksum, and as long as the computed checksum matches the stored value, you know the file has not been changed.

DES and AES work well for sealing values, because a key is needed to modify the stored value (to match modified data). Block ciphers also use a technique called chaining: a block is linked to the previous block’s value and hence to all previous blocks in a file like a chain by using an exclusive OR to combine the encrypted previous block with the encryption of the current one. Subsequently, a file’s cryptographic checksum could be the last block of the chained encryption of a file because that block depends on all other blocks. Popular hash functions are MD4, MD5 (MD meaning Message Digest), and SHA/SHS (Secure Hash Algorithm or Standard). In fact, Hadoop uses the SASL MD5-DIGEST mechanism for authentication when a Hadoop client with Hadoop token credentials connects to a Hadoop daemon (e.g., a MapReduce task reading/writing to HDFS).

Key Exchange

Suppose you need to exchange information with an unknown person (who does not know you either), while making sure that no one else has access to the information. The solution is public key cryptography. Because asymmetric keys come in pairs, one half of the pair can be exposed without compromising the other half. A private key can be used to encrypt, and the recipient just needs to have access to the public key to decrypt it. To understand the significance of this, consider an example key exchange.

Suppose Sam and Roy need to exchange a shared symmetric key, and both have public keys for a common encryption algorithm (call these K_PUB-S and K_PUB-R) as well as private keys (call these K_PRIV-S and K_PRIV-R). The simplest solution is for Sam to choose any symmetric key K, and encrypt it using his private key (K_PRIV-S) and send to Roy, who can use Sam’s public key to remove the encryption and obtain K. Unfortunately, anyone with access to Sam’s public key can also obtain the symmetric key K that is only meant for Roy. So, a more secure solution is for Sam to first encrypt the symmetric key K using his own private key and then encrypt it again using Roy’s public key. Then, Roy can use his private key to decrypt the first level of encryption (outer encryption)—something only he can do—and then use Sam’s public key to decrypt the “inner encryption” (proving that communication came from Sam). So, in conclusion, the symmetric key can be exchanged without compromising security.

Digital Signatures and Certificates

Today, most of our daily transactions are conducted in the digital world, so the concept of a signature for approval has evolved to a model that relies on mutual authentication of digital signatures. A digital signatureis a protocol that works like a real signature: it can provide a unique mark for a sender, and enable others to identify a sender from that mark and thereby confirm an agreement. Digital signatures need the following properties:

Unreproducible
Uniquely traceable source authenticity (from expected source only)
Inseparable from message
Immutable after being transmitted
Have recent one-time use and should not allow duplicate usage

Public key encryption systems are ideally suited to digital signatures. For example, a publishing company can first encrypt a contract using their own private key and then encrypt it again using the author’s public key. The author can use his private key to decrypt the first level of encryption, and then use publisher’s public key to decrypt the inner encryption to get to the contract. After that, the author can “sign” it by creating a hash value of the contract and then encrypting the contract and the hash with his own private key. Finally, he can add one more layer of encryption by encrypting again using the publisher’s public key and then e-mail the encrypted contract back to the publisher. Because only the author and publisher have access to their private keys, the exchange clearly is unforgeable and uniquely authentic. The hash function and checksum confirm immutability (assuming an initial checksum of the contract was computed and saved for comparison), while the frequency and timestamps of the e-mails ensure one-time recent usage. Figure 8-3 summarizes the process.

Figure 8-3. Using Digital signatures for encrypted communication

In Figure 8-3, E(C,K_PRIV-P) means contract C was encrypted using K_PRIV-P. Similarly, D(E(E(C,K_PRIV-P), K_PUB-A), K_PRIV-A) means the first level of the doubly encrypted contract sent to the author was decrypted using K_PRIV-A.

Founded on trust between parties through a common respected individual, a digital certificate serves a similar role among multiple parties that a digital signature does for two individuals. A public key and user’s identity are associated in a certificate, which is then “signed” by a certificate authority, certifying the accuracy of the association and authenticating identity.

For example, a publishing company might set up a certificate scheme to authenticate authors, their agents, and company editors in the following way. First, the publisher selects a public key pair, posts the public key where everyone in the company has access to it, and retains the private key. Then, each editor creates a public key pair, puts the public key in a message together with his or her identity, and passes the message securely to the publisher. The publisher signs it by creating a hash value of the message and then encrypting the message and the hash with his or her private key. By signing the message, the publisher affirms that the public key (the editor’s) and the identity (also the editor’s) in the message are for the same person. This message is called the editor’s certificate. The author can create a message with his public key, and the author’s agent can sign, hash, and return it. That will be the author’s certificate. So, the author and editor’s certificates can thus be set up and used for verifying their identities. Anyone can verify the editor’s certificate by starting with the publisher’s public key and decrypting the editor’s certificate to retrieve his or her public key and identity. The author’s certificate can be verified by starting with the public key the agent obtained from the publisher and using that to decrypt the certificate to retrieve the author’s public key and identity.

Because Hadoop uses different types of encryption for its various components, I will briefly discuss where each of these encryptions is used in the next section.

Hadoop Encryption Options Overview

When considering encryption of sensitive data in Hadoop, you need to consider data “at rest” stored on disks within your cluster nodes, and also data in transit, which is moved during communication among the various nodes and also between nodes and clients. Chapter 4 explained the details of securing data in transit between nodes and applications; you can configure individual Hadoop ecosystem components for encryption (using the component’s configuration file) just as you would configure Hadoop’s RPC communication for encryption. For example, to configure SSL encryption for Hive, you would need to change configuration within hive-site.xml (the property hive.server2.use.SSL in hive-site.xml needs to be set to true and the KeyStore needs to be specified using properties hive.server2.keystore.path and hive.server2.keystore.password). This chapter, therefore, focuses on configuring Hadoop data at rest.

Note Encryption is a CPU-intensive activity that can tax your hardware and slow its processing. Weigh the decision to use encryption carefully. If you determine encryption is necessary, implement it for all the data stored within your cluster as well as for processing related to that data.

For a Hadoop cluster, data at rest is the data distributed on all the DataNodes. Need for encryption may be because the data is sensitive and the information needs to be protected, or perhaps encryption is necessary for compliance with legal regulations like the insurance industry’s HIPAA or the financial industry’s SOX.

Although no Hadoop distribution currently provides encryption at rest, such major vendors as Cloudera and Hortonworks offer third-party solutions. For example, Cloudera works with zNcrypt from Gazzang to provide encryption at rest for data blocks as well as files. For additional protection, zNcrypt uses process-based ACLs and keys. In addition, Amazon Web Services (AWS) offers encryption at rest with its Elastic MapReduce web service and S3 storage (you’ll learn more about this shortly), and Intel’s distribution of Hadoop also offers encryption at rest. But all these solutions are either proprietary or limit you to a particular distribution of Hadoop.

For an open source solution to encrypt Hadoop data at rest, you can use Project Rhino. In 2013, Intel started an open source project to improve the security capabilities of Hadoop and the Hadoop ecosystem by contributing code to Apache. This code is not yet implemented in Apache Foundation’s Hadoop distribution, but it contains enhancements that include distributed key management and the capability to do encryption at rest. The overall goals for this open source project are as follows:

Support for encryption and key management
A common authorization framework (beyond ACLs)
A common token-based authentication framework
Security improvements to HBase
Improved security auditing

You can check the progress of Project Rhino at https://github.com/intel-hadoop/project-rhino, and learn more about it in the next section.

Encryption Using Intel’s Hadoop Distro

In 2013, Intel announced its own Hadoop distribution—a strange decision for a hardware manufacturing company, entering the Big Data arena belatedly with a Hadoop distribution. Intel, however, assured the Hadoop world that its intentions were only to contribute to the Hadoop ecosystem (Apache Foundation) and help out with Hadoop security concerns. Intel claimed its Hadoop distribution worked in perfect harmony with specific Intel chips (used as the CPU) to perform encryption and decryption about 10 to 15 times faster than current alternatives.

Around the same time, I had a chance to work with an Intel team on a pilot project for a client who needed data stored within HDFS to be encrypted, and I got to know how Intel’s encryption worked. The client used Hive for queries and reports and Intel offered encryption that covered HDFS as well as Hive. Although the distribution I used (which forms the basis of the information presented in this section), is not available commercially, most of the functionality it offered will be available through Project Rhino and Cloudera’s Hadoop distribution (now that Intel has invested in it).

Specifically, the Intel distribution used codecs to implement encryption (more on these in a moment) and offered file-level encryption that could be used with Hive or HBase. It used symmetric as well as asymmetric keys in conjunction with Java KeyStores (see the sidebar “KeyStores and TrustStores” for more information). The details of the implementation I used will give you some insight into the potential of Project Rhino.

KEYSTORES AND TRUSTSTORES

A KeyStore is a database or repository of keys or trusted certificates that are used for a variety of purposes, including authentication, encryption, and data integrity. A key entry contains the owner’s identity and private key, whereas a trusted certificate entry contains only a public key in addition to the entity’s identity. For better management and security, you can use two KeyStores: one containing your key entries and the other containing your trusted certificate entries (including Certificate Authorities’ certificates). Access can be restricted to the KeyStore with your private keys, while trusted certificates reside in a more publicly accessible TrustStore.

Used when making decisions about what to trust, a TrustStore contains certificates from someone you expect to communicate with or from Certificate Authorities that you trust to identify other parties. Add an entry to a TrustStore only if you trust the entity from which the potential entry originated.

Various types of KeyStores are available, such as PKCS12 or JKS. JKS is most commonly used in the Java world. PKCS12 isn’t Java specific but is convenient to use with certificates that have private keys backed up from a browser or the ones coming from OpenSSL-based tools. PKCS12 is mainly useful as a KeyStore but less so for a TrustStore, because it needs to have a private key associated with certificates. JKS doesn’t require each entry to be a private key entry, so you can use it as a TrustStore for certificates you trust but for which you don’t need private keys.

Step-by-Step Implementation

The client’s requirement was encryption at rest for sensitive financial data stored within HDFS and accessed using Hive. So, I had to make sure that the data file, which was pushed from SQL Server as a text file, was encrypted while it was stored within HDFS and also was accessible normally (with decryption applied) through Hive, to authorized users only. Figure 8-4 provides an overview of the encryption process.

Figure 8-4. Data encryption at Intel Hadoop distribution

The first step to achieve my goal was to create a secret (symmetric) key and KeyStore with the command (I created a directory/keys under my home directory and created all encryption related files there):

> keytool -genseckey -alias BCLKey -keypass bcl2601 -storepass bcl2601 -keyalg AES -keysize 256 -KeyStore BCLKeyStore.keystore -storetype JCEKS

This keytool command generates the secret key BCLKey and stores it in a newly created KeyStore called BCLKeyStore. The keyalg parameter specifies the algorithm AES to be used to generate the secret key, and keysize 256 specifies the size of the key to be generated. Last, keypass is the password used to protect the secret key, and storepass does the same for the KeyStore. You can adjust permissions for the KeyStore with:

> chmod 600 BCLKeyStore.keystore

Next, I created a key pair (private/public key) and KeyStore with the command:

> keytool -genkey -alias KEYCLUSTERPRIVATEASYM -keyalg RSA -keystore clusterprivate.keystore -storepass 123456 -keypass 123456 -dname "CN= JohnDoe, OU=Development, O=Intel, L=Chicago, S=IL, C=US" -storetype JKS -keysize 1024

This generates a key pair (a public key and associated private key) and single-element certificate chain stored as entry KEYCLUSTERPRIVATEASYM in the KeyStore clusterprivate.keystore. Notice the use of algorithm RSA for public key encryption and the key length of 1024. The parameter dname specifies the name to be associated with alias, and is used as the issuer and subject in the self-signed certificate.

I distributed the created KeyStore clusterprivate.keystore across the cluster using Intel Manager (admin) configuration security Key Management.

To create a TrustStore, I next took the following steps:

Extract the certificate from the newly created KeyStore with the command:
```
> keytool -export -alias KEYCLUSTERPRIVATEASYM -keystore clusterprivate.keystore -rfc -file hivepublic.cert -storepass 123456
```
From the KeyStore clusterprivate.keystore, the command reads the certificate associated with alias KEYCLUSTERPRIVATEASYM and stores it in the file hivepublic.cert. The certificate will be output in the printable encoding format (as the -rfc option indicates).
Create a TrustStore containing the public certificate:
```
> keytool -import -alias HIVEKEYCLUSTERPUBLICASYM -file hivepublic.cert -keystore clusterpublic.TrustStore -storepass 123456
```
This command reads the certificate (or certificate chain) from the file hivepublic.cert and stores it in the KeyStore (used as a TrustStore) entry identified by HIVEKEYCLUSTERPUBLICASYM. The TrustStore clusterpublic.TrustStore is created and the imported certificate is added to the list of trusted certificates.
Change clusterpublic.TrustStore ownership to root, group to hadoop, and permissions "644" (read/write for root and read for members of all groups) with the commands:
```
> chmod 644 clusterpublic.TrustStore
> chown root:hadoop clusterpublic.TrustStore
```
Create a file TrustStore.passwords, set its permission to “644”, and add the following contents to the file: keystore.password=123456.
Copy the /keys directory and all of its files to all the other nodes in the cluster. On each node, the KeyStore directory must be in /usr/lib/hadoop/.

With the TrustStore ready, I subsequently created a text file (bcl.txt) to use for testing encryption and copied it to HDFS:

> hadoop fs -mkdir /tmp/bcl.....
> hadoop fs -put bcl.txt /tmp/bcl

I started Pig (> pig) and was taken to the grunt> prompt. I executed the following commands within Pig to set all the required environment variables:

set KEY_PROVIDER_PARAMETERS 'keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore&keyStoreType-JCEKS&password=bcl2601';
set AGENT_SECRETS_PROTECTOR 'com.intel.hadoop.mapreduce.crypto.KeyStoreKeyProvider';

set AGENT_PUBLIC_KEY_PROVIDER 'org.apache.hadoop.io.crypto.KeyStoreKeyProvider';

set AGENT_PUBLIC_KEY_PROVIDER_PARAMETERS 'keyStoreUrl=file:////keys/clusterpublic.TrustStore&keyStoreType=JKS&password=123456';
set AGENT_PUBLIC_KEY_NAME 'HIVEKEYCLUSTERPUBLICASYM';
set pig.encrypt.keyProviderParameters 'keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore&keyStoreType-JCEKS&password=bcl2601';

Next, to read the bcl.txt file from HDFS, encrypt it, and store it into the same location in a directory named bcl_encrypted, I issued the commands:

raw = LOAD '/tmp/bcl/bcl.txt' AS (name:chararray,age:int,country:chararray);
STORE raw INTO '/tmp/bcl/bcl_encrypted' USING PigStorage('	','-keyName BCLKey'),

After exiting Pig, I checked contents of the encrypted file by issuing the command:

> hadoop fs -cat /tmp/bcl/bcl_encrypted/part-m-00000.aes

The control characters appeared to confirm the encryption. I created a Hive external table and pointed it to the encrypted file using the following steps:

Start Hive.

Set the environment variables:

set hive.encrypt.master.keyName=BCLKey;
set hive.encrypt.master.keyProviderParameters=keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore&keyStoreType=JCEKS&password=bcl2601;
set hive.encrypt.keyProviderParameters=keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore&keyStoreType=JCEKS&password=bcl2601;
set mapred.crypto.secrets.protector.class=com.intel.hadoop.mapreduce.cryptocontext.provider.AgentSecretsProtector;
set mapred.agent.encryption.key.provider=org.apache.hadoop.io.crypto.KeyStoreKeyProvider;
set mapred.agent.encryption.key.provider.parameters=keyStoreUrl=file:////keys/clusterpublic.TrustStore&keyStoreType=JKS&password=123456;
set mapred.agent.encryption.keyname=HIVEKEYCLUSTERPUBLICASYM;

Create an encrypted external table pointing to the encrypted data file created by Pig:

create external table bcl_encrypted_pig_data(name STRING, age INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tmp/bcl/bcl_encrypted/' TBLPROPERTIES("hive.encrypt.enable"="true", "hive.encrypt.keyName"="BCLKey");

Once the table is created, decrypted data can be viewed by any authorized client (having appropriate key and certificate files within /usr/lib/hadoop/keys directory) using the select query (in Hive syntax) at the Hive prompt:
```
> select * from bcl_encrypted_pig_data;
```

To summarize, to implement the Intel distribution for use with Hive, I set up the keys, KeyStores, and certificates that were used for encryption. Then I extracted the certificate from the KeyStore and imported it into a TrustStore. Note that although I created the key pair and certificate for a user JohnDoe in the example, for a multiuser environment you will need to create a key pair and certificates for all authorized users.

A symmetric key was used to encrypt data within HDFS (and with Hive). MapReduce used a public key and certificate, because client communication within Hive uses MapReduce. That’s also the reason a key pair and certificate will be necessary for authorized users for Hive (who are authorized to access the encrypted data).

Special Classes Used by Intel Distro

The desired functionality of encryption at rest needs special codecs, classes, and logic implemented. Although many classes and codecs were available, they didn’t work in harmony backed by a common logic to provide the encryption functionality. Intel has added the underlying logic in its distribution.

For example, org.apache.hadoop.io.crypto.KeyStoreKeyProvider is an implementation of the class org.apache.hadoop.io.crypto.KeyProvider. The corresponding Apache class for HBase is org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider, which is an implementation of org.apache.hadoop.hbase.io.crypto.KeyProvider. This class is used to resolve keys from a protected KeyStore file on the local file system. Intel has used this class to manage keys stored in KeyStore (and TrustStore) files. The other Hbase classes used are:

org.apache.hadoop.hbase.io.crypto.Cipher
org.apache.hadoop.hbase.io.crypto.Decryptor
org.apache.hadoop.hbase.io.crypto.Encryptor

How are these classes used? For example, in Java terms, the method Encryption.decryptWithSubjectKey for class org.apache.hadoop.hbase.io.crypto.Cipher decrypts a block of encrypted data using the symmetric key provided; whereas the method Encryption.encryptWithSubjectKey encrypts a block of data using the provided symmetric key. So, to summarize, this class provides encryption/decryption using the symmetric key.

The Intel custom class com.intel.hadoop.mapreduce.crypto.KeyStoreKeyProvider was designed for encrypted MapReduce processing and works similar to the Apache Hadoop crypto class mapred.crypto.KeyStoreKeyProvider. It is adapted for use with MapReduce jobs and is capable of processing certificates as well as keys.

Most of these classes are developed and used by the Apache Foundation. The only difference is that the Apache Foundation’s Hadoop distribution doesn’t use these classes to provide cumulative functionality of encryption at rest, nor do any of the other distributions available commercially. Project Rhino is trying to remedy that situation, and since even the Intel custom classes and codecs are available for their use, you can expect the encryption-at-rest functionality to be available through Project Rhino very soon.

Using Amazon Web Services to Encrypt Your Data

As you have seen, installing and using encryption can be a tough task, but Amazon has consciously endeavored to make it simple. AWS offers easy options that eliminate most of the work and time needed to install, configure, manage, and use encryption with Hadoop. With AWS, you have the option of doing none, some or all of the work depending on the configured service you rent. For example, if you need to focus on other parts of your project (such as design of ETL for bulk load of data from RDBMS (relational database management system) to HDFS or Analytics), you can have AWS take care of fully implementing encryption at rest for your data.

Deciding on a Model for Data Encryption and Storage

AWS provides several configurations or models for encryption usage. The first model, model A, lets you control the encryption method as well as KMI (key management infrastructure). It offers you the utmost flexibility and control, but you do all the work. Model B lets you control the encryption method while AWS stores the keys.; you still get to manage your keys. The most rigid choice, model C, gives you no control over KMI or encryption method, although it is the easiest to implement because AWS does it all for you. To implement model C, you need to use an AWS service that supports server-side encryption, such as Amazon S3, Amazon EMR, Amazon Redshift, or Amazon Glacier.

To demonstrate, I will implement encryption at rest using Amazon’s model C. Why C? The basic steps are easy to understand, and you can use the understanding you gain to implement model A, for which you need to implement all the tasks (I have provided steps for implementing model A as a download on the Apress web site). I will use Amazon EMR (or Elastic MapReduce, which provides an easy-to-use Hadoop implementation running on Amazon Elastic Compute Cloud, or EC2) along with Amazon S3 for storage. Please note: One caveat of renting the EMR service is that AWS charges by the “normalized” hour, not actual hours, because the plan uses multiple AWS “appliances” and at least two EC2 instances.

If you are unfamiliar with AWS’s offerings, EC2 is the focal point of AWS. EC2 allows you to rent a virtual server (or virtual machine) that is a preconfigured Amazon Machine Image with desired operating system and choice of virtual hardware resources (CPU, RAM, disk storage, etc.). You can boot (or start) this virtual machine or instance and run your own applications as desired. The term elastic refers to the flexible, pay-by-hour model for any instances that you create and use. Figure 8-5 displays AWS management console . This is where you need to start for “renting” various AWS components (assuming you have created an AWS account first): http://aws.amazon.com.

Figure 8-5. AWS Management console

Getting back to the implementation using model C, if you specify server-side encryption while procuring the EMR cluster (choose the Elastic MapReduce option in the AWS console as shown in Figure 8-5), the EMR model provides server-side encryption of your data and manages the encryption method as well as keys transparently for you. Figure 8-6 depicts the “Envelope encryption” method AWS uses for server-side encryption. The basic steps are as follows:

The AWS service generates a data key when you request that your data be encrypted.
AWS uses the data key for encrypting your data.
The encrypted data key and the encrypted data are stored using S3 storage
AWS uses the key-encrypting key (unique to S3 in this case) to encrypt the data key and store it separately from the data key and encrypted data.

Figure 8-6. Envelope encryption by AWS

For data retrieval and decryption, this process is reversed. First, the encrypted data key is decrypted using the key-encrypting key, and then it is used to decrypt your data.

As you can see from Figure 8-6, the S3 storage service supports server-side encryption. Amazon S3 server-side encryption uses 256-bit AES symmetric keys for data keys as well as master (key-encrypting) keys.

Encrypting a Data File Using Selected Model

In this section, I will discuss step-by-step implementation for the EMR-based model C, in which AWS manages your encryption method and keys transparently. As mentioned earlier, you can find the steps to implement model A on the Apress web site.

Create S3 Storage Through AWS

You need to create the storage first, because you will need it for your EMR cluster. Simply log in to the AWS management console, select service S3, and create a bucket named htestbucket and a folder test within (Figure 8-7).

Figure 8-7. Create an S3 bucket and folder

Specify server-side encryption for folder test that you created (Figure 8-8).

Figure 8-8. Activate server-side encryption for a folder

Adjust the permissions for the bucket htestbucket created earlier, as necessary (Figure 8-9).

Figure 8-9. Adjust permissions for an S3 bucket

Create a Key Pair (bclkey) to Be Used for Authentication

Save the .pem file to your client. Use PuTTYgen to create a .ppk (private key file) that can be used for authentication with PuTTY to connect to the EMR cluster (Master node). For details on using PuTTY and PuTTYgen, please see Chapter 4 and Appendix B. Figure 8-10 shows the AWS screen for key pair creation. To reach it, choose service EC2 on the AWS management console, and then the option Key Pairs.

Figure 8-10. Creating a key pair within AWS

Create an Access Key ID and a Secret Access Key

These keys are used as credentials for encryption and are associated with a user. If you don’t have any users created and are using the root account for AWS, then you need to create these keys for root. From the Identity and Access Management (IAM) Management console (Figure 8-11), select Dashboard, and then click the first option, Delete your root access keys. (If you don’t have these keys created for root, you won’t see this warning). To reach IAM console, choose service “Identity & Access Management” on the AWS management console.

Figure 8-11. IAM console for AWS

Click the Manage Security Credentials button, and ignore the warning to “Continue to Security credentials” (Figure 8-12).

Figure 8-12. Creation of security credentials

Your AWS root account is like a UNIX root account, and AWS doesn’t recommend using that. Instead, create user accounts with roles, permissions, and access keys as needed. If you do so, you can more easily customize permissions without compromising security. Another thing to remember about using the root account is that you can’t retrieve the access key ID or secret access key if you lose it! So, I created a user Bhushan for use with my EMR cluster (Figure 8-13). I used the “Users” option and “Create New Users” button from the Identity and Access Management (IAM) Management console (Figure 8-11) to create this new user.

Figure 8-13. Creation of a user for use with EMR cluster

To set your keys for a user, again begin on the IAM Management Console, and select the Users option, then a specific user (or create a user). Next, open the Security Credentials area and create an access key ID and a secret access key for the selected user (Figure 8-13).

Note When you create the access key ID and secret access key, you can download them and save them somewhere safe as a backup. Taking this precaution is certainly easier than creating a fresh set of keys if you lose them.

Create the AWS EMR Cluster Specifying Server-Side Encryption

With the preparatory steps finished, you’re ready to create an EMR cluster. Log on to the AWS management console, select the Elastic MapReduce service, and click the Create Cluster button. Select the “Server-side encryption” and “Consistent view” configuration options and leave the others at their defaults (Figure 8-14).

Figure 8-14. Creation of EMR cluster

In the Hardware Configuration section (Figure 8-15), request one Master EC2 instance to run JobTracker and NameNode and one Core EC2 instance to run TaskTrackers and DataNodes. (This is just for testing; in the real world, you would need to procure multiple Master or Core instances depending on the processing power you require.) In the Security Access section, specify one of the key pairs created earlier (bclkey), while in the IAM Roles section, set EMR_DefaultRole and EMR_EC2_DefaultRole for the EMR roles. Make sure that these roles have permissions to access the S3 storage (bucket and folders) and any other resources you need to use.

Figure 8-15. Hardware configuration for EMR cluster

After you check all the requested configuration options, click on the “Create Cluster” button at the bottom of the screen to create an EMR cluster as per your requirements.

In a couple of minutes, you will receive a confirmation of cluster creation similar to Figure 8-16.

Figure 8-16. EMR cluster created

Test Encryption

As a final step, test if the “at rest” encryption between EMR and S3 is functional. As per the AWS and EMR documentation, any MapReduce jobs transferring data from HDFS to S3 storage (or S3 to HDFS) should encrypt the data written to persistent storage.

You can verify this using the Amazon utility S3DistCp, which is designed to move large amounts of data between Amazon S3 and HDFS (from the EMR cluster). S3DistCp supports the ability to request Amazon S3 to use server-side encryption when it writes EMR data to an Amazon S3 bucket you manage. Before you use it, however, you need to add the following configuration to your core-site.xml (I have blanked out my access keys):

<property>
 <name>fs.s3.awsSecretAccessKey</name>
 <value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
 <name>fs.s3.awsAccessKeyId</name>
 <value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
<property>
 <name>fs.s3n.awsSecretAccessKey</name>
 <value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
 <name>fs.s3n.awsAccessKeyId</name>
 <value>yyyyyyyyyyyyyyyyyyyy</value>
</property>

Remember to substitute values for your own access key ID and secret access key. There is no need to restart any Hadoop daemons.

Next, make sure that the following jars exist in your /home/hadoop/lib (/lib under my Hadoop install directory). If not, find and copy them there:

/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/gson-2.1.jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/EmrMetrics-1.0.jar
/home/hadoop/lib/httpcore-4.1.jar
/home/hadoop/lib/httpclient-4.1.1.jar

Now, you’re ready to run the S3DistCp utility and copy a file test1 from HDFS to folder test for S3 bucket htestbucket:

> hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar -libjars /home/hadoop/lib/gson-2.1.jar,/home/hadoop/lib/emr-s3distcp-1.0.jar,/home/hadoop/lib/EmrMetrics-1.0.jar,/home/hadoop/lib/httpcore-4.1.jar,/home/hadoop/lib/httpclient-4.1.1.jar --src /tmp/test1 --dest s3://htestbucket/test/ --disableMultipartUpload --s3ServerSideEncryption

My example produced the following response in a few seconds:

14/10/10 03:27:47 INFO s3distcp.S3DistCp: Running with args: -libjars /home/hadoop/lib/gson-2.1.jar,/home/hadoop/lib/emr-s3distcp-1.0.jar,/home/hadoop/lib/EmrMetrics-1.0.jar,/home/hadoop/lib/httpcore-4.1.jar,/home/hadoop/lib/httpclient-4.1.1.jar --src /tmp/test1 --dest s3://htestbucket/test/ --disableMultipartUpload --s3ServerSideEncryption
....
....
14/10/10 03:27:51 INFO client.RMProxy: Connecting to ResourceManager at
14/10/10 03:27:54 INFO mapreduce.Job: The url to track the job: http://10.232.45.82:9046/proxy/application_1412889867251_0001/
14/10/10 03:27:54 INFO mapreduce.Job: Running job: job_1412889867251_0001
14/10/10 03:28:12 INFO mapreduce.Job:  map 0% reduce 0%
....
....
14/10/10 03:30:17 INFO mapreduce.Job:  map 100% reduce 100%
14/10/10 03:30:18 INFO mapreduce.Job: Job job_1412889867251_0001 completed successfully

Clearly, the MapReduce job copied the file successfully to S3 storage. Now, you need to verify if the file is stored encrypted within S3. To do so, use the S3 management console and check properties of file test1 within folder test in bucket htestbucket (Figure 8-17).

Figure 8-17. Verifying server-side encryption for MapReduce job

As you can see, the property Server Side Encryption is set to AES-256, meaning the MapReduce job from the EMR cluster successfully copied data to S3 storage with server-side encryption!

You can try other ways of invoking MapReduce jobs (e.g., Hive queries or Pig scripts) and write to S3 storage to verify that the stored data is indeed encrypted. You can also use S3DistCp to transfer data from your own local Hadoop cluster to Amazon S3 storage. Just make sure that you copy the AWS credentials in core-site.xml on all nodes within your local cluster and that the previously listed six .jar files are in the /lib subdirectory of your Hadoop install directory.

If you’d like to compare this implementation of encryption using AWS EMR with implementation of the more hands-on model A (in which you manage encryption and keys, plus you need to install specific software on EC2 instances for implementing encryption), remember you can download and review those steps from the Apress web site.

You’ve now seen both alternatives for providing encryption at rest with Hadoop (using Intel’s Hadoop distribution and using AWS). If you review carefully, you will realize that they do have commonalities in implementing encryption. Figure 8-18 summarizes the generic steps.

Figure 8-18. DataNode uses symmetric key (from client) to decrypt the data block and if successful, retrieves the data block. Respective DataNodes retrieve and pass subsequent data blocks back

Summary

Encryption at rest with Hadoop is still a work in progress, especially for the open source world. Perhaps when Hadoop is used more extensively in the corporate world, our options will improve. For now, you must turn to paid third-party solutions. The downside to these third-party solutions is that even though they claim to work with specific distributions, their claims are difficult to verify. Also, it is not clear how much custom code they add to your Hadoop install and what kind of performance you actually get for encryption/decryption. Last, these solutions are not developed or tested by trained cryptographers or cryptanalysts. So, there is no reliability or guarantee that they are (and will be) “unbreakable.”

Intel entered the Hadoop and encryption-at-rest arena with lot of publicity and hype, but quickly backed off and invested in Cloudera instead. Now the future of Project Rhino and possible integration of that code with Cloudera’s distribution doesn’t seem very clear. There are open source applications in bits and pieces, but a robust, integrated solution that can satisfy the practical encryption needs of a serious Hadoop practitioner doesn’t exist yet.

For now, let’s hope that this Hadoop area generates enough interest among users to drive more options in the future for implementing encryption using open source solutions.

Whatever the future holds, for the present, this is the last chapter. I sincerely hope this book has facilitated your understanding of Hadoop security options and helps you make your environment secure!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Encryption in Hadoop

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Encryption in Hadoop