Chapter 4: Open Source Authentication in Hadoop

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 4

Open Source Authentication in Hadoop

In previous chapters, you learned what a secure system is and what Hadoop security is missing in comparison to what the industry considers a secure system—Microsoft SQL Server (a relational database system). This chapter will focus on implementing some of the features of a secure system to secure your Hadoop cluster from all the Big Bad Wolves out there. Fine-tuning security is more art than a science. There are no rules as to what is “just right” for an environment, but you can rely on some basic conventions to help you get closer—if not “just right.” For example, because Hadoop is a distributed system and is mostly accessed using client software on a Windows PC, it makes sense to start by securing the client. Next, you can think about securing the Hadoop cluster by adding strong authentication, and so on.

Before you can measure success, however, you need a yardstick. In this case, you need a vision of the ideal Hadoop security setup. You’ll find the details in the next section.

Pieces of the Security Puzzle

Figure 4-1 diagrams an example of an extensive security setup for Hadoop. It starts with a secure client. The SSH protocol secures the client using key pairs; the server uses a public key, and the client uses a private key. This is to counter spoofing (intercepting and redirecting a connection to an attacker’s system) and also a hacked or compromised password. You’ll delve deeper into the details of secure client setup in the upcoming “Establishing Secure Client Access” section. Before the Hadoop system allows access, it authenticates a client using Kerberos (an open-source application used for authentication). You’ll learn how to set up Kerberos and make it work with Hadoop in the section “Building Secure User Authentication.”

Once a user is connected, the focus is on limiting permissions as per the user’s role. The user in Figure 4-1 has access to all user data except sensitive salary data. You can easily implement this by splitting the data into multiple files and assigning appropriate permissions to them. Chapter 5 focuses on these authorization issues and more.

Figure 4-1. Ideal Hadoop Security, with all the required pieces in place

You will also observe that inter-process communication between various Hadoop processes (e.g., between NameNode and DataNodes) is secure, which is essential for a distributed computing environment. Such an environment involves a lot of communication between various hosts, and unsecured data is open to various types of malicious attacks. The final section of this chapter explores how to secure or encrypt the inter-process traffic in Hadoop.

These are the main pieces of the Hadoop security puzzle. One piece that’s missing is encryption for data at rest, but you’ll learn more about that in Chapter 8.

Establishing Secure Client Access

Access to a Hadoop cluster starts at the client you use, so start by securing the client. Unsecured data is open to malicious attacks that can result in data being destroyed or stolen for unlawful use. This danger is greater for distributed systems (such as Hadoop) that have data blocks spread over a large number of nodes. A client is like a gateway to the actual data. You need to secure the gate before you can think about securing the house.

OpenSSH or SSH protocol is commonly used to secure a client by using a login/password or keys for access. Keys are preferable because a password can be compromised, hacked, or spoofed. For both Windows-based and Linux-based clients, PuTTY (www.chiark.greenend.org.uk/~sgtatham/putty) is an excellent open-source client that supports the SSH protocol. Besides being free, a major advantage to PuTTY is its ability to allow access using keys and a passphrase instead of password (more on the benefits of this coming up). Assistance in countering spoofing is a less obvious, yet equally important additional benefit of PuTTY that deserves your attention.

Countering Spoofing with PuTTY’s Host Keys

Spoofing, as you remember, is a technique used to extract your personal information (such as a password) for possible misuse, by redirecting your connection to the attacker’s computer (instead of the one you think you are connected to), so that you send your password to the attacker’s machine. Using this technique, attackers get access to your password, log in, and use your account for their own malicious purposes.

To counter spoofing, a unique code (called a host key) is allocated to each server. The way these keys are created, it’s not possible for a server to forge another server’s key. So if you connect to a server and it sends you a different host key (compared to what you were expecting), SSH (or a secure client like PuTTY that is using SSH) can warn you that you are connected to a different server—which could mean a spoofing attack is in progress!

PuTTY stores the host key (for servers you successfully connect to) via entries in the Windows Registry. Then, the next time you connect to a server to which you previously connected, PuTTY compares the host key presented by the server with the one stored in the registry from the last time. If it does not match, you will see a warning and then have a chance to abandon your connection before you provide a password or any other private information.

However, when you connect to a server for the first time, PuTTY has no way of checking if the host key is the right one or not. So it issues a warning that asks whether you want to trust this host key or not:

The server's host key is not cached in the registry. You
have no guarantee that the server is the computer you
think it is.
The server's rsa2 key fingerprint is:
ssh-rsa 1024 5c:d4:6f:b7:f8:e9:57:32:3d:a3:3f:cf:6b:47:2c:2a
If you trust this host, hit Yes to add the key to
PuTTY's cache and carry on connecting.
If you want to carry on connecting just once, without
adding the key to the cache, hit No.
If you do not trust this host, hit Cancel to abandon the
connection.

If the host is not known to you or you have any doubts about whether the host is the one you want to connect to, you can cancel the connection and avoid being a victim of spoofing.

Key-Based Authentication Using PuTTY

Suppose a super hacker gets into your network and gains access to the communication from your client to the server you wish to connect to. Suppose also that this hacker captures the host authentication string that the real host sends to your client and returns it as his own to get you to connect to his server instead of the real one. Now he can easily get your password and can use that to access sensitive data.

How can you stop such an attack? The answer is to use key-based authentication instead of a password. Without the public key, the hacker won’t be able to get access!

One way to implement keys for authentication is to use SSH, which is a protocol used for communicating securely over a public channel or a public, unsecured network. The security of communication relies on a key pair used for encryption and decryption of data. SSH can be used (or implemented) in several ways. You can automatically generate a public/private key pair to encrypt a network connection and then use password authentication to log on. Another way to use SSH is to generate a public/private key pair manually to perform the authentication, which will allow users or programs to log in without specifying a password.

For Windows-based clients, you can generate the key pair using PuTTYgen, which is open source and freely available. Key pairs consist of a public key, which is copied to the server, and a private key, which is located on the secure client.

The private key can be used to generate a new signature. A signature generated with a private key cannot be forged by anyone who does not have that key. However, someone who has the corresponding public key can check if a particular signature is genuine.

When using a key pair for authentication, PuTTY can generate a signature using your private key (specified using a key file). The server can check if the signature is genuine (using your public key) and allow you to log in. If your client is being spoofed, all that the attacker intercepts is a signature that can’t be reused, but your private key or password is not compromised. Figure 4-2 illustrates the authentication process.

Figure 4-2. Key-based authentication using PuTTY

To set up key-based authentication using PuTTY, you must first select the type of key you want. For the example, I’ll use RSA and set up a key pair that you can use with a Hadoop cluster. To set up a key pair, open the PuTTY Key Generator (PuTTYgen.exe). At the bottom of the window, select the parameters before generating the keys. For example, to generate an RSA key for use with the SSH-2 protocol, select SSH-2 RSA under Type of key to generate. The value for Number of bits in a generated key determines the size or strength of the key. For this example, 1024 is sufficient, but in a real-world scenario, you might need a longer key such as 2048 for better security. One important thing to remember is that a longer key is more secure, but the encryption/decryption processing time increases with the key length. Enter a key passphrase (to encrypt your private key for protection) and make a note of it since you will need to use it later for decryption.

Note The most common public-key algorithms available for use with PuTTY are RSA and DSA. PuTTY developers strongly recommend you use RSA; DSA (also known as DSS, the United States’ federal Digital Signature Standard) has an intrinsic weakness that enables easy creation of a signature containing enough information to give away the private key. (To better understand why RSA is almost impossible to break, see Chapter 8.)

Next, click the Generate button. In response, PuTTYgen asks you to move the mouse around to generate randomness (that’s the PuTTYgen developers having fun with us!). Move the mouse in circles over the blank area in the Key window; the progress bar will gradually fill as PuTTYgen collects enough randomness and keys are generated as shown in Figure 4-3.

Figure 4-3. Generating a key pair for implementing secure client

Once the keys are generated, click the Save public key and Save private key buttons to save the keys.

Next, you need to copy the public key to the file authorized_keys located in the .ssh directory under your home directory on the server you are trying to connect to. For that purpose, please refer to the section Public key for pasting into Open SSH authorized_keys file in Figure 4-3. Move your cursor to that section and copy all the text (as shown). Then, open a PuTTY session and connect using your login and password. Change to directory .ssh and open the authorized_keys file using editor of your choice. Paste the text of the public key that you created with PuTTYgen into the file, and save the file (Figure 4-4).

Figure 4-4. Pasting the public key in authorized_keys file

Using Passphrases

What happens if someone gets access to your computer? They can generate signatures just as you would. Then, they can easily connect to your Hadoop cluster using your credentials! This can of course be easily avoided by using the passphrase of your choice to encrypt your private key before storing it on your local machine. Then, for generating a signature, PuTTY will need to decrypt the key and that will need your passphrase, thereby preventing any unauthorized access.

Now, the need to type a passphrase whenever you log in can be inconvenient. So, Putty provides Pageant, which is an authentication agent that stores decrypted private keys and uses them to generate signatures as requested. All you need to do is start Pageant and enter your private key along with your passphrase. Then you can invoke PuTTY any number of times; Pageant will automatically generate the signatures. This arrangement will work until you restart your Windows client. Another nice feature of Pageant is that when it shuts down, it will never store your decrypted private key on your local disk.

So, as a last step, configure your PuTTY client to use the private key file instead of a password for authentication (Figure 4-5). Click the + next to option SSH to open the drill-down and then click option Auth (authorization) under that. Browse and select the private key file you saved earlier (generated through PuTTYgen). Click Open to open a new session.

Figure 4-5. Configuration options for private key authentication with PuTTY

Now you are ready to be authenticated by the server using login and passphrase as shown in Figure 4-6. Enter the login name at the login prompt (root in this case) and enter the passphrase to connect!

Figure 4-6. Secure authentication using login and a passphrase

In some situations (e.g., scheduled batch processing), it will be impossible to type the passphrase; at those times, you can start Pageant and load your private key into it by typing your passphrase once. Please refer to Appendix A for an example of Pageant use and implementation and Appendix B for PuTTY implementation for Linux-based clients.

Building Secure User Authentication

A secure client connection is vital, but that’s only a good starting point. You need to secure your Hadoop cluster when this secure client connects to it. The user security process starts with authenticating a user. Although Hadoop itself has no means of authenticating a user, currently all the major Hadoop distributions are available with Kerberos installed, and Kerberos provides authentication.

With earlier versions of Hadoop, when a user tried to access a Hadoop cluster, Hadoop simply checked the ACL to ensure that the underlying OS user was allowed access, and then provided this access. This was not a very secure option, nor did it limit access for a user (since a user could easily impersonate the Hadoop superuser). The user then had access to all the data within a Hadoop cluster and could modify or delete it if desired. Therefore, you need to configure Kerberos or another similar application to authenticate a user before allowing access to data—and then, of course, limit that access, too!

Kerberos is one of the most popular options used with Hadoop for authentication. Developed by MIT, Kerberos has been around since the 1980s and has been enhanced multiple times. The current version, Kerberos version 5 was designed in 1993 and is freely available as an open source download. Kerberos is most commonly used for securing Hadoop clusters and providing secure user authentication. In this section you’ll learn how Kerberos works, what its main components are, and how to install it. After installation, I will discuss a simple Kerberos implementation for Hadoop.

Kerberos Overview

Kerberos is an authentication protocol for “trusted hosts on untrusted networks.” It simply means that Kerberos assumes that all the hosts it’s communicating with are to be trusted and that there is no spoofing involved or that the secret key it uses is not compromised. To use Keberos more effectively, consider a few other key facts:

Kerberos continuously depends on a central server. If the central server is unavailable, no one can log in. It is possible to use multiple “central” servers (to reduce the risk) or additional authentication mechanisms (as fallback).
Kerberos is heavily time dependent, and thus the clocks of all the governed hosts must be synchronized within configured limits (5 minutes by default). Most of the times, Network Time Protocol daemons help to keep the clocks of the governed hosts synchronized.
Kerberos offers a single sign-on approach. A client needs to provide a password only once per session and then can transparently access all authorized services.
Passwords should not be saved on clients or any intermediate application servers. Kerberos stores them centrally without any redundancy.

Figure 4-7 provides an overview of Kerberos authentication architecture. As shown, the Authentication Server and Ticket Granting Server are major components of the Kerberos key distribution center.

Figure 4-7. Kerberos key distribution center with its main components (TGT = Ticket Granting Ticket)

A client requests access to a Kerberos-enabled service using Kerberos client libraries. The Kerberos client contacts the Kerberos Distribution Center, or KDC (the central Kerberos server that hosts the credential database) and requests access. If the provided credentials are valid, KDC provides requested access. The KDC uses an internal database for storing credentials, along with two main components: the Authentication Server (AS) and the Ticket Granting Server (TGS).

Authentication

The Kerberos authentication process contains three main steps:

The AS grants the user (and host) a Ticket Granting Ticket (TGT) as an authentication token. A TGT is valid for a specific time only (validity is configured by Administrator through the configuration file). In case of services principles (logins used to run services or background processes) requesting TGT, credentials are supplied to the AS through special files called keytabs.
The client uses credentials to decrypt the TGT and then uses the TGT to get service ticket from the Ticket Granting Server to access a “kerberized” service. A client can use the same TGT for multiple TGS requests (till the TGT expires).
The user (and host) uses the service ticket to authenticate and access a specific Kerberos-enabled service.

Important Terms

To fully understand Kerberos, you need to speak its language of realms, principals, tickets, and databases. For an example of a Kerberos implementation, you are implementing Kerberos on a single node cluster called pract_hdp_sec, and you are using a virtual domain or realm called EXAMPLE.COM.

The term realm indicates an administrative domain (similar to a Windows domain) used for authentication. Its purpose is to establish the virtual boundary for use by an AS to authenticate a user, host, or service. This does not mean that the authentication between a user and a service forces them to be in the same realm! If the two objects belong to different realms but have a trust relationship between them, then the authentication can still proceed (called cross-authentication). For our implementation, I have created a single realm called EXAMPLE.COM (note that by convention a realm typically uses capital letters).

A principal is a user, host, or service associated with a realm and stored as an entry in the AS database typically located on KDC. A principal in Kerberos 5 is defined using the following format: Name[/Instance]@REALM. Common usage for users is username@REALM or username/role@REALM (e.g., alex/admin@REALM and alex@REALM are two different principals that might be defined). For service principals, the common format is service/hostname@REALM (e.g., hdfs/host1.myco.com). Note that Hadoop expects a specific format for its service principals. For our implementation, I have defined principles, such as hdfs/[email protected] (hdfs for NameNode and DataNode), mapred/[email protected] (mapred for JobTracker and TaskTracker), and so on.

A ticket is a token generated by the AS when a client requests authentication. Information in a ticket includes: the requesting user’s principal (generally the username), the principal of the service it is intended for, the client’s IP address, validity date and time (in timestamp format), ticket's maximum lifetime, and session key (this has a fundamental role). Each ticket expires, generally after 24 hours, though this is configurable for a given Kerberos installation.

In addition, tickets may be renewed by user request until a configurable time period from issuance (e.g., 7 days from issue). Users either explicitly use the Kerberos client to obtain a ticket or are provided one automatically if the system administrator has configured the login client (e.g., SSH) to obtain the ticket automatically on login. Services typically use a keytab file (a protected file having the services’ password contained within) to run background threads that obtain and renew the TGT for the service as needed. All Hadoop services will need a keytab file placed on their respective hosts, with the location of this file being defined in the service site XML.

Kerberos uses an encrypted database to store all the principal entries associated with users and services. Each entry contains the following information: principal name, encryption key, maximum validity for a ticket associated with a principal, maximum renewal time for a ticket associated with a principal, password expiration date, and expiration date of the principal (after which no tickets will be issued).

There are further details associated with Kerberos architecture, but because this chapter focuses on installing and configuring Kerberos for Hadoop, basic understanding of Kerberos architecture will suffice for our purposes. So let’s start with Kerberos installation.

Installing and Configuring Kerberos

The first step for installing Kerberos is to install all the Kerberos services for your new KDC. For Red Hat Enterprise Linux (RHEL) or CentOS operating systems, use this command:

yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation

When the server is installed, you must edit the two main configuration files, located by default in the following directories (if not, use Linux utility “find” to locate them):

/etc/krb5.conf
/var/kerberos/krb5kdc/kdc.conf

The next phase is to specify your realm (EXAMPLE.COM for the example) and to change the KDC value to the name of the fully qualified Kerberos server host (here, pract_hdp_sec). You must also copy the updated version of /etc/krb5.conf to every node in your cluster. Here is /etc/krb5.conf for our example:

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = EXAMPLE.COM
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[kdc]
profile = /var/kerberos/krb5kdc/kdc.conf

[realms]
 EXAMPLE.COM = {
  kdc = pract_hdp_sec
  admin_server = pract_hdp_sec
 }

[domain_realm]
 .example.com = EXAMPLE.COM
  example.com = EXAMPLE.COM

Please observe the changed values for the realm name and KDC name. The example tickets will be valid for up to 24 hours after creation, so ticket_lifetime is set to 24h. After 7 days those tickets can be renewed, because renew_lifetime is set to 7d. Following is the /var/kerberos/krb5kdc/kdc.conf I am using:

[kdcdefaults]
 kdc_ports = 88
 kdc_tcp_ports = 88

[realms]
 EXAMPLE.COM = {
  profile = /etc/krb5.conf
  supported_enctypes = aes128-cts:normal des3-hmac-sha1:normal
arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
allow-null-ticket-addresses = true
database_name = /var/Kerberos/krb5kdc/principal
#master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  dict_file = /usr/share/dict/words
  max_life = 2d 0h 0m 0s
  max_renewable_life = 7d 0h 0m 0s
  admin_database_lockfile = /var/kerberos/krb5kdc/kadm5_adb.lock
  key_stash_file = /var/kerberos/krb5kdc/.k5stash
  kdc_ports = 88
  kadmind_port = 749
  default_principle_flags = +renewable
}

Included in the settings for realm EXAMPLE.COM, the acl_file parameter specifies the ACL (file /var/kerberos/krb5kdc/kadm5.acl in RHEL or CentOS) used to define the principals that have admin (modifying) access to the Kerberos database. The file can be as simple as a single entry:

*/[email protected] *

This entry specifies all principals with the /admin instance extension have full access to the database. Kerberos service kadmin needs to be restarted for the change to take effect.

Also, observe that the max_life (maximum ticket life) setting is 2d (2 days) for the realm EXAMPLE.COM. You can override configuration settings for specific realms. You can also specify these values for a principal.

Note in the [realms] section of the preceding code that I have disabled 256-bit encryption. If you want to use 256-bit encryption, you must download the Java Cryptography Extension (JCE) and follow the instructions to install it on any node running Java processes using Kerberos (for Hadoop, all cluster nodes). If you want to skip this and just use 128-bit encryption, remove the line #master_key_type = aes256-cts and remove the references to aes-256 before the generation of your KDC master key, as described in the section “Creating a Database.”

This concludes installing and setting up Kerberos. Please note that it’s not possible to cover all the possible options (operating systems, versions, etc.) and nuances of Kerberos installation in a single section. For a more extensive discussion of Kerberos installation, please refer to MIT’s Kerberos installation guide at http://web.mit.edu/kerberos/krb5-1.6/krb5-1.6/doc/krb5-install.html. O’Reilly’s Kerberos: The Definitive Guide is also a good reference.

Getting back to Kerberos implementation, let me create a database and set up principals (for use with Hadoop).

Preparing for Kerberos Implementation

Kerberos uses an internal database (stored as a file) to save details of principals that are set up for use. This database contains users (principals) and their private keys. Principals include internal users that Kerberos uses as well as those you define. The database file is stored at location defined in configuration file kdc.conf file; for this example, /var/kerberos/krb5kdc/principal.

Creating a Database

To set up a database, use the utility kdb5_util:

kdb5_util create -r EXAMPLE.COM –s

You will see a response like:

Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'EXAMPLE.COM',
master key name 'K/[email protected]'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:

Please make a note of the master key. Also, please note that -s option allows you to save the master server key for the database in a stash file (defined using parameter key_stash_file in kdc.conf). If the stash file doesn’t exist, you need to log into the KDC with the master password (specified during installation) each time it starts. This will automatically regenerate the master server key.

Now that the database is created, create the first user principal. This must be done on the KDC server itself, while you are logged in as root:

/usr/sbin/kadmin.local -q "addprinc root/admin"

You will be prompted for a password. Please make a note of the password for principal root/[email protected]. You can create other principals later; now, it’s time to start Kerberos. To do so for RHEL or CentOS operating systems, issue the following commands to start Kerberos services (for other operating systems, please refer to appropriate command reference):

/sbin/service kadmin start
/sbin/service krb5kdc start

Creating Service Principals

Next, I will create service principals for use with Hadoop using the kadmin utility. Principal name hdfs will be used for HDFS; mapred will be used for MapReduce, HTTP for HTTP, and yarn for YARN-related services (in this code, kadmin: is the prompt; commands are in bold):

[root@pract_hdp_sec]# kadmin
Authenticating as principal root/[email protected] with password.
Password for root/[email protected]:
kadmin:  addprinc -randkey hdfs/[email protected]
Principal "hdfs/[email protected]" created.
kadmin:  addprinc -randkey mapred/[email protected]
Principal "mapred/[email protected]" created.
kadmin:  addprinc -randkey HTTP/[email protected]
Principal "HTTP/[email protected]" created.
kadmin:  addprinc -randkey yarn/[email protected]
Principal "yarn/[email protected]" created.
kadmin:

Creating Keytab Files

Keytab files are used for authenticating services non-interactively. Because you may schedule the services to run remotely or at specific time, you need to save the authentication information in a file so that it can be compared with the Kerberos internal database. Keytab files are used for this purpose.

Getting back to file creation, extract the related keytab file (using kadmin) and place it in the keytab directory (/etc/security/keytabs) of the respective components (kadmin: is the prompt; commands are in bold):

[root@pract_hdp_sec]# kadmin
Authenticating as principal root/[email protected] with password.
Password for root/[email protected]:
kadmin: xst -k mapred.keytab hdfs/[email protected] HTTP/[email protected]
Entry for principal hdfs/[email protected] with kvno 5, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:mapred.keytab.
Entry for principal hdfs/[email protected] with kvno 5, encryption type des3-cbc-sha1 added to keytab WRFILE:mapred.keytab.
Entry for principal hdfs/[email protected] with kvno 5, encryption type arcfour-hmac added to keytab WRFILE:mapred.keytab.
Entry for principal hdfs/[email protected] with kvno 5, encryption type des-hmac-sha1 added to keytab WRFILE:mapred.keytab.
Entry for principal hdfs/[email protected] with kvno 5, encryption type des-cbc-md5 added to keytab WRFILE:mapred.keytab.
Entry for principal HTTP/[email protected] with kvno 4, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:mapred.keytab.
Entry for principal HTTP/[email protected] with kvno 4, encryption type des3-cbc-sha1 added to keytab WRFILE:mapred.keytab.
Entry for principal HTTP/[email protected] with kvno 4, encryption type arcfour-hmac added to keytab WRFILE:mapred.keytab.
Entry for principal HTTP/[email protected] with kvno 4, encryption type des-hmac-sha1 added to keytab WRFILE:mapred.keytab.
Entry for principal HTTP/[email protected] with kvno 4, encryption type des-cbc-md5 added to keytab WRFILE:mapred.keytab.

Please observe that key entries for all types of supported encryption (defined in configuration file kdc.conf as parameter supported_enctypes) are added to the keytab file for the principals.

Getting back to keytab creation, create keytab files for the other principals (at the kadmin prompt) as follows:

kadmin:xst -k mapred.keytab hdfs/[email protected] http/[email protected]
kadmin:xst -k yarn.keytab hdfs/[email protected] http/[email protected]

You can verify that the correct keytab files and principals are associated with the correct service using the klist command. For example, on the NameNode:

[root@pract_hdp_sec]# klist -kt mapred.keytab
Keytab name: FILE:mapred.keytab
KVNO Timestamp         Principal
---- ----------------- --------------------------------------------------------
   5 10/18/14 12:42:21 hdfs/[email protected]
   5 10/18/14 12:42:21 hdfs/[email protected]
   5 10/18/14 12:42:21 hdfs/[email protected]
   5 10/18/14 12:42:21 hdfs/[email protected]
   5 10/18/14 12:42:21 hdfs/[email protected]
   4 10/18/14 12:42:21 HTTP/[email protected]
   4 10/18/14 12:42:21 HTTP/[email protected]
   4 10/18/14 12:42:21 HTTP/[email protected]
   4 10/18/14 12:42:21 HTTP/[email protected]
   4 10/18/14 12:42:21 HTTP/[email protected]

So far, you have defined principals and extracted keytab files for HDFS, MapReduce, and YARN-related principals only. You will need to follow the same process and define principals for any other component services running on your Hadoop cluster such as Hive, HBase, Oozie, and so on. Note that the principals for web communication must be named HTTP as web-based protocol implementations for using Kerberos require this naming.

For deploying the keytab files to slave nodes, please copy (or move if newly created) the keytab files to the /etc/hadoop/conf folder. You need to secure the keytab files (only the owner can see this file). So, you need to change the owner to the service username accessing the keytab (e.g., if the HDFS process runs as user hdfs, then user hdfs should own the keytab file) and set file permission 400. Please remember, the service principals for hdfs, mapred, and http have a FQDN (fully qualified domain name) associated with the username. Also, service principals are host specific and unique for each node.

[root@pract_hdp_sec]# sudo mv hdfs.keytab mapred.keytab /etc/hadoop/conf/
[root@pract_hdp_sec]# sudo chown hdfs:hadoop /etc/hadoop/conf/hdfs.keytab
[root@pract_hdp_sec]# sudo chown mapred:hadoop /etc/hadoop/conf/mapred.keytab
[root@pract_hdp_sec]# sudo chmod 400 /etc/hadoop/conf/hdfs.keytab
[root@pract_hdp_sec]# sudo chmod 400 /etc/hadoop/conf/mapred.keytab

Implementing Kerberos for Hadoop

So far, I have installed and configured Kerberos and also created the database, principals, and keytab files. So, what’s the next step for using this authentication for Hadoop? Well, I need to add the Kerberos setup information to relevant Hadoop configuration files and also map the Kerberos principals set up earlier to operating systems users (since operating system users will be used to actually run the Hadoop services). I will also need to assume that a Hadoop cluster in a non-secured mode is configured and available. To summarize, configuring Hadoop for Kerberos will be achieved in two stages:

Mapping service principals to their OS usernames
Adding information to various Hadoop configuration files

Mapping Service Principals to Their OS Usernames

Rules are used to map service principals to their respective OS usernames. These rules are specified in the Hadoop configuration file core-site.xml as the value for the optional key hadoop.security.auth_to_local.

The default rule is simply named DEFAULT. It translates all principals in your default domain to their first component. For example, [email protected] and hdfs/[email protected] both become hdfs, assuming your default domain or realm is EXAMPLE.COM. So if the service principal and the OS username are the same, the default rule is sufficient. If the two names are not identical, you have to create rules to do the mapping.

Each rule is divided into three parts: base, filter, and substitution. The base begins by specifying the number of components in the principal name (excluding the realm), followed by a colon, and the pattern for building the username from the sections of the principal name. In the pattern section $0 translates to the realm, $1 translates to the first component, and $2 to the second component. So, for example, [2:$1] translates hdfs/[email protected] to hdfs.

The filter consists of a regular expression in parentheses that must match the generated string for the rule to apply. For example, (.*@EXAMPLE.COM) matches any string that ends in @EXAMPLE.COM.

The substitution is a sed (popular Linux stream editor) rule that translates a regular expression into a fixed string. For example: s/@[A-Z]*.COM// removes the first instance of @ followed by an uppercase alphabetic name, followed by .COM.

In my case, I am using the OS user hdfs to run the NameNode and DataNode services. So, if I had created Kerberos principals nn/[email protected] and dn/[email protected] for use with Hadoop, then I would need to map these principals to the OS user hdfs. The rule for this purpose would be:

RULE: [2:$1@$0] ([nd]n@.*EXAMPLE.COM) s/.*/hdfs/

Adding Information to Various Hadoop Configuration Files

To enable Kerberos to work with HDFS, you need to modify two configuration files:

core-site.xml
hdfs-site.xml

Table 4-1 shows modifications to properties within core-site.xml. Please remember to propagate these changes to all the hosts in your cluster.

Table 4-1. Modifications to Properties in Hadoop Configuration File core-site.xml

Property Name	Property Value	Description
hadoop.security.authentication	kerberos	Set Authentication type for the cluster. Valid values are simple (default) or Kerberos.
hadoop.security.authorization	true	Enable authorization for different protocols
hadoop.security.auth_to_local	[2:$1] DEFAULT	The mapping from Kerberos principal names to local OS user names using the mapping rules
hadoop.rpc.protection	privacy	Possible values are authentication, integrity, and privacy. authentication = mutual client/server authentication integrity = authentication and integrity; guarantees the integrity of data exchanged between client and server as well as authentication privacy = authentication, integrity, and confidentiality; encrypts data exchanged between client and server

The hdfs-site.xml configuration file specifies the keytab locations as well as principal names for various HDFS daemons. Please remember, hdfs and http principals are specific to a particular node.

A Hadoop cluster may contain a large number of DataNodes, and it may be virtually impossible to configure the principals manually for each of them. Therefore, Hadoop provides a _HOST variable that resolves to a fully qualified domain name at runtime. This variable allows site XML to remain consistent throughout the cluster. However, please note that _HOST variable can’t be used with all the Hadoop configuration files. For example, the jaas.conf file used by Zookeeper (which provides resource synchronization across cluster nodes and can be used by applications to ensure that tasks across the cluster are serialized or synchronized) and Hive doesn’t support the _HOST variable. Table 4-2 shows modifications to properties within hdfs-site.xml, some of which use the _HOST variable. Please remember to propagate these changes to all the hosts in your cluster.

Table 4-2. Modified Properties for Hadoop Confguration File hdfs-site.xml

Property Name	Property Value	Description
dfs.block.access.token.enable	True	If true, access tokens are used for accessing DataNodes
dfs.namenode.kerberos. principal	hdfs/_HOST@EXAMPLE.COM	Kerberos principal name for the NameNode
dfs.secondary.namenode. kerberos.principal	hdfs/_HOST @EXAMPLE.COM	Address of secondary NameNode webserver
^*dfs.secondary.https.port	50490	The https port to which the secondary NameNode binds
dfs.web.authentication.kerberos. principal	HTTP/_HOST @EXAMPLE.COM	The http Kerberos principal used by Hadoop
dfs.namenode.kerberos. internal.spnego. principal	HTTP/_HOST @EXAMPLE.COM	This is the http principal for the HTTP service
dfs.secondary.namenode. kerberos.internal. spnego.principal	HTTP/_HOST @EXAMPLE.COM	This is the http principal for the http service
^*dfs.secondary.http.address	192.168.142.135:50090	IP address of your secondary NameNode host and port 50090
dfs.web.authentication. kerberos.keytab	/etc/hadoop/conf/spnego.service.keytab	Kerberos keytab file with credentials for http principal
dfs.datanode.kerberos.principal	hdfs/_HOST @EXAMPLE.COM	The Kerberos principal that runs the DataNode
dfs.namenode. keytab.file	/etc/hadoop/conf/ hdfs.keytab	keytab file containing NameNode service and host principals
dfs.secondary.namenode. keytab.file	/etc/hadoop/conf/ hdfs.keytab	keytab file containing NameNode service and host principals
dfs.datanode. keytab.file	/etc/hadoop/conf/ hdfs.keytab	keytab file for DataNode
^*dfs.https.port	50470	The https port to which the NameNode binds
^*dfs.https.address	192.168.142.135:50470	The https address for NameNode (IP address of host + port 50470)
dfs.datanode.address	0.0.0.0:1019	The DataNode server address and port for data transfer.
dfs.datanode.http.address	0.0.0.0:1022	The DataNode http server address and port

^*These values may change for your cluster

The files core-site.xml and hdfs-site.xml are included as downloads for your reference. They also contain Kerberos-related properties set up for other components such as Hive, Oozie, and HBase.

MapReduce-Related Configurations

For MapReduce (version 1), the mapred-site.xml file needs to be configured to work with Kerberos. It needs to specify the keytab file locations as well as principal names for the JobTracker and TaskTracker daemons. Use Table 4-3 as a guide, and remember that mapred principals are specific to a particular node.

Table 4-3. mapred Principals

Property Name	Property Value	Description
mapreduce.jobtracker. kerberos.principal	mapred/_HOST@EXAMPLE.COM	mapred principal used to start JobTracker daemon
mapreduce.jobtracker. keytab.file	/etc/hadoop/conf/mapred.keytab	Location of the keytab file for the mapred user
mapreduce.tasktracker. kerberos.principal	mapred/_HOST@EXAMPLE.COM	mapred principal used to start TaskTracker daemon
mapreduce.tasktracker. keytab.file	/etc/hadoop/conf/mapred.keytab	Location of the keytab file for the mapred user
mapred.task.tracker. task-controller	org.apache. hadoop.mapred. LinuxTaskController	TaskController class used to launch the child JVM
mapreduce.tasktracker.group	mapred	Group for running TaskTracker
mapreduce.jobhistory.keytab	/etc/hadoop/conf/ mapred.keytab	Location of the keytab file for the mapred user
mapreduce.jobhistory.principal	mapred/_HOST@EXAMPLE.COM	mapred principal used to start JobHistory daemon

For YARN, the yarn-site.xml file needs to be configured for specifying the keytab and principal details; Table 4-4 holds the details.

Table 4-4. YARN Principals

Property Name	Property Value	Description
yarn.resourcemanager.principal	yarn/_HOST@EXAMPLE.COM	yarn principal used to start ResourceManager daemon
yarn.resourcemanager.keytab	/etc/hadoop/conf/yarn.keytab	Location of the keytab file for the yarn user
yarn.nodemanager.principal	yarn/_HOST@EXAMPLE.COM	yarn principal used to start NodeManager daemon
yarn.nodemanager.keytab	/etc/hadoop/conf/yarn.keytab	Location of the keytab file for the yarn user
yarn.nodemanager.container-executor.class	org.apache.hadoop.yarn.server.nodemanager. LinuxContainerExecutor	Executor class for launching applications in yarn
yarn.nodemanager.linux-containerexecutor.group	yarn	Group for executing Linux containers

For MapReduce (version 1), the TaskController class defines which Map or Reduce tasks are launched (and controlled) and uses a configuration file called task-controller.cfg. This configuration file is present in the Hadoop configuration folder (/etc/hadoop/conf/) and should have the configurations listed in Table 4-5.

Table 4-5. TaskController Configurations

Property Name	Property Value	Description
hadoop.log.dir	/var/log/hadoop- 0.20-mapreduce	Hadoop log directory (will vary as per your Hadoop distribution). This location is used to make sure that proper permissions exist for writing to logfiles.
mapreduce. tasktracker.group	mapred	Group that the Task Tracker belongs to
banned.users	mapred, hdfs, and bin	Users who should be prevented from running MapReduce
min.user.id	1000	User ID above which MapReduce tasks will be allowed to run

Here’s a sample task-controller.cfg:

hadoop.log.dir=/var/log/hadoop-0.20-mapreduce/
mapred.local.dir=/opt/hadoop/hdfs/mapred/local
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=500

Please note that the value for min.user.id may change depending on the operating system. Some of the operating systems use a value of 0 instead of 500.

For YARN, you need to define containerexecutor.cfg with the configurations in Table 4-6.

Table 4-6. YARN containerexecutor.cfg Configurations

Property Name	Property Value	Description
yarn.nodemanager.log-dirs	/var/log/yarn	Hadoop log directory (will vary as per your Hadoop distribution). This location is used to make sure that proper permissions exist for writing to logfiles.
yarn.nodemanager.linux-containerexecutor.group	yarn	Group that the container belongs to
banned.users	hdfs, yarn, mapred, and bin	Users who should be prevented from running MapReduce
min.user.id	1000	User ID above which MapReduce tasks will be allowed to run

As a last step, you have to set the following variables on all DataNodes in file /etc/default/hadoop-hdfs-datanode. These variables provide necessary information to Jsvc, a set of libraries and applications for making Java applications run on Unix more easily, so it can run the DataNode in secure mode.

export HADOOP_SECURE_DN_USER=hdfs
export HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs
export JSVC_HOME=/usr/lib/bigtop-utils/

If the directory /usr/lib/bigtop-utils doesn’t exist, set the JSVC_HOME variable to the /usr/libexec/bigtop-utils as following:

export JSVC_HOME=/usr/libexec/bigtop-utils

So, finally, having installed, configured, and implemented Kerberos and modified various Hadoop configuration files (with Kerberos implementation information), you are ready to start NameNode and DataNode services with authentication!

Starting Hadoop Services with Authentication

Start the NameNode first. Execute the following command as root and substitute the correct path (to where your Hadoop startup scripts are located):

su -l hdfs -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode";

After the NameNode starts, you can see Kerberos-related messages in NameNode log file indicating successful authentication (for principals hdfs and http) using keytab files:

2013-12-10 14:47:22,605 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(844)) - Login successful for user hdfs/[email protected] using keytab file /etc/hadoop/conf/hdfs.keytab

2013-12-10 14:47:24,288 INFO server.KerberosAuthenticationHandler (KerberosAuthenticationHandler.java:init(185)) - Login using keytab /etc/hadoop/conf/hdfs.keytab, for principal HTTP/[email protected]

Now start the DataNode: Execute the following command as root and substitute the correct path (to where your Hadoop startup scripts are located):

su -l hdfs -c  "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode"

After the DataNode starts, you can see the following Kerberos-related messages in the DataNode log file indicating successful authentication (for principal hdfs) using keytab file:

2013-12-08 10:34:33,791 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(844)) - Login successful for user hdfs/[email protected] using keytab file /etc/hadoop/conf/hdfs.keytab

2013-12-08 10:34:34,587 INFO http.HttpServer (HttpServer.java:addGlobalFilter(525)) - Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)

2013-12-08 10:34:35,502 INFO datanode.DataNode (BlockPoolManager.java:doRefreshNamenodes(193)) - Starting BPOfferServices for nameservices: <default>

2013-12-08 10:34:35,554 INFO datanode.DataNode (BPServiceActor.java:run(658)) - Block pool <registering> (storage id unknown) service to pract_hdp_sec/192.168.142.135:8020 starting to offer service

Last, start the SecondaryNameNode. Execute the following command as root and substitute the correct path (to where your Hadoop startup scripts are located):

su -l hdfs -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start secondarynamenode";

Congratulations, you have successfully “kerberized” HDFS services! You can now start MapReduce services as well (you have already set up the necessary principals and configuration in MapReduce configuration files).

Please understand that the commands I have used in this section may vary with the version of the operating system (and the Hadoop distribution). It is always best to consult your operating system and Hadoop distributor’s manual in case of any errors or unexpected behavior.

Securing Client-Server Communications

With earlier Hadoop versions, when daemons (or services) communicated with each other, they didn’t verify that the other service is really what it claimed to be. So, it was easily possible to start a rogue TaskTracker to get access to data blocks. Impersonating services could easily get access to sensitive data, destroy data, or bring the cluster down! Even now, unless you have Kerberos installed and configured and also have the right communication protocols encrypted, the situation is not very different. It is very important to secure inter-process communication for Hadoop. Just using an authentication mechanism (like Kerberos) is not enough. You also have to secure all the means of communication Hadoop uses to transfer data between its daemons as well as communication between clients and the Hadoop cluster.

Inter-node communication in Hadoop uses the RPC, TCP/IP, and HTTP protocols. Specifically, RPC (remote procedure call) is used for communication between NameNode, JobTracker, DataNodes, and Hadoop clients. Also, the actual reading and writing of file data between clients and DataNodes uses TCP/IP protocol, which is not secured by default, leaving the communication open to attacks. Last, HTTP protocol is used for communication by web consoles, for communication between NameNode/Secondary NameNode, and also for MapReduce shuffle data transfers. This HTTP communication is also open to attacks unless secured.

Therefore, you must secure all these Hadoop communications in order to secure the data stored within a Hadoop cluster. Your best option is to use encryption. Encrypted data can’t be used by malicious attackers unless they have means of decrypting it. The method of encryption you employ depends on the protocol involved. To encrypt TCP/IP communication, for example, an SASL wrapper is required on top of the Hadoop data transfer protocol to ensure secured data transfer between the Hadoop client and DataNode. The current version of Hadoop allows network encryption (in conjunction with Kerberos) by setting explicit values in configuration files core-site.xml and hdfs-site.xml. To secure inter-process communications between Hadoop daemons, which use RPC protocol, you need to use SASL framework. The next sections will take a closer look at encryption, starting with RPC-based communications.

Safe Inter-process Communication

Inter-process communication in Hadoop is achieved through RPC calls. That includes communication between a Hadoop client and HDFS and also among Hadoop services (e.g., between JobTracker and TaskTrackers or NameNode and DataNodes).

SASL (Simple Authentication and Security Layer) is the authentication framework that can be used to guarantee that data exchanged between the client and servers is encrypted and not vulnerable to “man-in-the-middle” attacks (please refer to Chapter 1 for details of this type of attack). SASL supports multiple authentication mechanisms (e.g., MD5-DIGEST, GSSAPI, SASL PLAIN, CRAM-MD5) that can be used for different contexts.

For example, if you are using Kerberos for authentication, then SASL uses a GSSAPI (Generic Security Service Application Program Interface) mechanism to authenticate any communication between Hadoop clients and Hadoop daemons. For a secure Hadoop client (authenticated using Kerberos) submitting jobs, delegation token authentication is used, which is based on SASL MD5-DIGEST protocol. A client requests a token to NameNode and passes on the received token to TaskTracker, and can use it for any subsequent communication with NameNode.

When you set the hadoop.rpc.protection property in Hadoop configuration file core-site.xml to privacy, the data over RPC will be encrypted with symmetric keys. Here’s the XML:

<property>
<name>hadoop.rpc.protection</name>
<value>privacy</value>
<description>authentication, integrity & confidentiality guarantees that data exchanged between client and server is encrypted
</description>
</property>

Encryption comes at a price, however. As mentioned in Table 4-1, setting hadoop.rpc.protection to privacy means Hadoop performs integrity checks, encryption, and authentication, and all of this additional processing will degrade performance.

Encrypting HTTP Communication

Hadoop uses HTTP communication for web consoles, communication between NameNode/Secondary NameNode, and for MapReduce (shuffle data). For a MapReduce job, the data moves between the Mappers and the Reducers via the HTTP protocol in a process called a shuffle. The Reducer initiates a connection to the Mapper, requesting data, and acts as a SSL client. The steps for enabling HTTPS to encrypt shuffle traffic are detailed next.

Certificates are used to secure the communication that uses HTTP protocol. You can use the Java utility keytool to create and store certificates. Certificates are stored within KeyStores (files) and contain keys (private key and identity) or certificates (public keys and identity). For additional details about KeyStores, please refer to Chapter 8 and Appendix C. A TrustStore file contains certificates from trusted sources and is used by the secure HTTP (https) clients. Hadoop HttpServer uses the KeyStore files.

After you create the HTTPS certificates and distribute them to all the nodes, you can configure Hadoop for HTTP encryption. Specifically, you need to configure SSL on the NameNode and all DataNodes by setting property dfs.https.enable to true in the Hadoop configuration file hdfs-site.xml.

Most of the time, SSL is configured to authenticate the server only, a mode called one-way SSL. For one-way SSL, you only need to configure the KeyStore on the NameNode (and each DataNode), using the properties shown in Table 4-7. These parameters are set in the ssl-server.xml file on the NameNode and each of the DataNodes.

You can also configure SSL to authenticate the client; this mode is called mutual authentication or two-way SSL. To configure two-way SSL, set the property dfs.client.https.need-auth to true in the Hadoop configuration file hdfs-site.xml (on the NameNode and each DataNode), in addition to setting the property dfs.https.enable to true.

Table 4-7. SSL Properties to Encrypt HTTP Communication

Property	Default Value	Description
ssl.server.keystore.type	jks	KeyStore file type
ssl.server.keystore.location	NONE	KeyStore file location. The mapred user should own this file and have exclusive read access to it.
ssl.server.keystore.password	NONE	KeyStore file password
ssl.server.truststore.type	jks	TrustStore file type
ssl.server.truststore.location	NONE	TrustStore file location. The mapred user must be file owner with exclusive read access.
ssl.server.truststore.password	NONE	TrustStore file password
ssl.server.truststore.reload.interval	10000	TrustStore reload interval, in milliseconds

Appendix C has details of setting up KeyStore and TrustStore to use for HTTP encryption.

To configure an encrypted shuffle, you need to set the properties listed in Table 4-8 in the core-site.xml files of all nodes in the cluster.

Table 4-8. core-site.xml Properties for Enabling Encrypted Shuffle (for MapReduce)

Property	Value	Explanation
hadoop.ssl.enabled	true	For MRv1, setting this value to true enables both the Encrypted Shuffle and the Encrypted Web UI features. For MRv2, this property only enables the Encrypted WebUI; Encrypted Shuffle is enabled with a property in the mapred-site.xml file as described in “Encrypting HTTP Communication.”
hadoop.ssl.require. client.cert	true	When set to true, client certificates are required for all shuffle operations and all browsers used to access Web UIs.
hadoop.ssl.hostname. verifier	DEFAULT	The hostname verifier to provide for HttpsURLConnections. Valid values are DEFAULT, STRICT, STRICT_I6, DEFAULT_AND_LOCALHOST, and ALLOW_ALL.
hadoop.ssl.keystores. factory.class	org.apache.hadoop.security. ssl.FileBasedKeyStoresFactory	The KeyStoresFactory implementation to use.
hadoop.ssl.server.conf	ssl-server.xml	Resource file from which SSL server KeyStore information is extracted. This file is looked up in the classpath; typically it should be in the /etc/hadoop/conf/ directory.
hadoop.ssl.client.conf	ssl-client.xml	Resource file from which SSL server KeyStore information is extracted. This file is looked up in the classpath; typically it should be in the /etc/hadoop/conf/ directory.

To enable Encrypted Shuffle for MRv2, set the property mapreduce.shuffle.ssl.enabled in the mapred-site.xml file to true on every node in the cluster.

To summarize, for configuring Encrypted Shuffle (for MapReduce jobs) and Encrypted Web UIs, the following configuration files need to be used/modified:

core-site.xml/hdfs-site.xml: for enabling HTTP encryption and defining implementation
mapred-site.xml: enabling Encrypted Shuffle for MRv2
ssl-server.xml: storing KeyStore and TrustStore settings for server
ssl-client.xml: storing KeyStore and TrustStore settings for the client

Securing Data Communication

Data transfer (read/write) between clients and DataNodes uses the Hadoop Data Transfer Protocol. Because the SASL framework is not used here for authentication, a SASL handshake or wrapper is required if this data transfer needs to be secured or encrypted. This wrapper can be enabled by setting the property dfs.encrypt.data.transfer to true in configuration file hdfs-site.xml. When the SASL wrapper is enabled, a data encryption key is generated by NameNode and communicated to DataNodes and the client. The client uses the key as a credential for any subsequent communication. NameNode and DataNodes use it for verifying the client communication.

If you have a preference regarding the actual algorithm that you want to use for encryption, you can specify that using the property dfs.encrypt.data.transfer.algorithm. The possible values are 3des or rc4 (default is usually 3DES.) 3DES, or “triple DES,” is a variation of the popular symmetric key algorithm DES that uses three keys (instead of the single key DES uses) to add strength to the protocol. You encrypt with one key, decrypt with the second, and encrypt with a third. This process gives a strength equivalent to a 112-bit key (instead of DES’s 56-bit key) and makes the encryption stronger, but is slow (due to multiple iterations for encryption). Please refer to Chapter 8 for additional details on DES protocol. RC4 is another symmetric key algorithm that performs encryption much faster as compared to 3DES, but is potentially unsafe (Microsoft and Cisco are both phasing out this algorithm and have clear guidelines to their users to avoid any usage of it).

Please note that since RPC protocol is used to send the Data Encryption Keys to the clients, it is necessary to configure the hadoop.rpc.protection setting to privacy in the configuration file core-site.xml (for client and server both), to ensure that the transfer of keys themselves is encrypted and secure.

Summary

In this chapter you learned how to establish overall security or a “fence” for your Hadoop cluster, starting with the client. Currently, PuTTY offers the best open source options for securing your client. I discussed using a key pair and passphrase instead of the familiar login/password alternative. The reason is simple—to make it harder for malicious attacks to break through your security. Everyone has used PuTTY, but many times they don’t think about the underlying technology and reason for using some of the available options. I have tried to shed some light on those aspects of PuTTY.

I am not sure if MIT had Hadoop in mind when they developed Kerberos; but the current usage of Kerberos with Hadoop might make you think otherwise! Again, it is (by far) the most popular alternative for Hadoop authentication.

Dealing with KeyStores and TrustStores is always a little harder for non-Java personnel. If you need another example, Appendix C will help further your understanding those concepts.

The use of SASL protocol for RPC encryption and the underlying technology for encrypting data transfer protocol are complex topics. This chapter’s example of implementing a secure cluster was merely intended to introduce the topic.

Where do you go from here? Is the job finished now that the outer perimeter of your cluster is secure? Certainly not! This is where it begins—and it goes on to secure your cluster further by specifying finer details of authorization. That’s the subject of the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Open Source Authentication in Hadoop

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Open Source Authentication in Hadoop