Hadoop and Kerberos

As you saw in the previous sections, Hadoop provides all the components to restrict access to various resources and services. There is still one piece of the puzzle missing, though. Since Hadoop doesn't maintain any internal user database, it has to completely trust users' identities as provided by the operating system. While Linux-based operating systems authenticate users with passwords or public/private key pairs, once a user is logged in, there is no way for Hadoop to correctly verify his/her identity. In the early versions of Hadoop, HDFS and MapReduce clients were executing an equivalent of the whoami shell command to get the identity of the user.

This was a very unsecure way of doing things, because it allowed a rogue user to just substitute the whoami command with a custom script that would return any username it liked.

In the latest version of Hadoop, code that retrieves the user identity was changed to use Java SecurityManager API, but the approach is still open to various security issues. One could change the client source code to use any identity and use this altered program to connect to the cluster. There are other possibilities of gaining unauthorized access to the cluster. An attacker might intercept and alter communication traffic between the client and Hadoop services, since it is not encrypted.

To address this problem, Hadoop supports authentication via an external Kerberos protocol. Kerberos is a protocol that was designed to allow participants to securely identify and authenticate themselves over an unsecure network. There are different implementations of this protocol available, but we will focus on MIT Kerberos.

Kerberos overview

Before we go over the steps required to implement Kerberos authentication with Hadoop, it is worth giving a brief overview of this protocol. Kerberos is a client-server protocol. It consists of Key Distribution Center (KDC), as well as client programs.

KDC in its turn consists of several components. Authentication Server (AS) is responsible for verifying the user identity and issuing a Ticket-Granting Ticket (TGT). AS has a local copy of the user's password and each TGT is encrypted with this password. When a client receives a TGT, it tries to decrypt it using password. If the password that the user provides and the one that AS stores match, then TGT can be successfully decrypted and used. Decrypted TGT is used to obtain an authentication ticket from Ticket-Granting Service (TGS). This ticket is used to authenticate users against all the required services.

In Kerberos terminology, a user is called a principal. The principal consists of the following three components:

  • Primary component: It is essentially a username.
  • Instance component: It can be used to identify different roles for the same user, or to identify the same user on different servers in the Hadoop case.
  • Realm component: It can be thought of as a domain in DNS.

Here is an example of a Kerberos principal:

This is how the user alice connecting from one of the DataNodes will present herself to KDC.

Here is how the user would perform authentication with KDC and receive TGT:

[alice@dn1]$ kinit
Password for [email protected]:

The ticket obtained this way will be cached on the local filesystem and will be valid for the duration specified by the KDC administrator. Normally, this time frame is 8 to 12 hours, so users don't have to enter their passwords for every single operation. To be able to properly identify the realm, Kerberos client programs need to be installed on the server and the configuration needs to be provided in the /etc/krb5.conf file.

Kerberos in Hadoop

When Hadoop is configured with Kerberos support, all cluster interactions need to be first authenticated with KDC. This is valid, not only for cluster users, but for all Hadoop services as well. When Kerberos support is enabled, DataNode needs to have a proper ticket before it can communicate with NameNode.

This complicates the initial deployment, since you will need to generate principals for every service on every Hadoop node, as well as create principals for every cluster user. Since Hadoop services cannot interactively provide passwords, they use pregenerated keytab files, which are placed on each server.

After all principals are created and keytab files are distributed on all the servers, you will need to adjust the Hadoop configuration file to specify the principal and keytab file locations.

Note

At this point, you should decide if implementing Kerberos on your cluster is required. Depending on the environment and type of data stored in the cluster, you may find that basic authentication provided by OS is enough in your case. If you have strict security requirements, implementing Kerberos support is the only solution available right now. Keep in mind that when enabled, Kerberos affects all the services in the cluster. It is not possible to implement partial support, let's say, for external users only.

Configuring Kerberos clients

We will not review the installation and configuration of KDC, since it's a vast topic in itself. We will assume that you have a dedicated MIT Kerberos Version 5 installed and configured, and you have KDC administrator account privileges.

The first task that you need to do is to install and configure the Kerberos client on all the servers. To install client programs, run the following command:

# yum install krb5-workstation.x86_64

After the client is installed, you need to edit the /etc/krb5.conf file and provide a Hadoop realm that was configured on KDC. We will use the HADOOP.TEST.COM realm in all the following examples. The name of the realm doesn't matter much in this case and you can choose a different one, if you'd like. In a production setup, you may want to use different realms for different clusters, such as production and QA.

Generating Kerberos principals

We will generate principals and keytab files for HDFS, MapReduce, and HTTP services. The HTTP principal is required to support built-in web services that are part of the HDFS and MapReduce daemons, and expose some status information to the users.

We will demonstrate how to generate these principals for one DataNode, because DataNodes will require HDFS, MapReduce, and HTTP principals to be specified. You will need to repeat this procedure for all the hosts in your cluster.

Tip

Automating principals generation

You can easily script commands to create Kerberos principals and generate keytab files and apply them to all servers. This will help you to avoid typos and mistakes.

Log in to the KDC server, switch to the root user, and execute the following commands:

# kadmin.local
Authenticating as principal root/[email protected] with password.

Some command-line output is omitted.

# kadmin.local
Authenticating as principal root/[email protected] with password.
addprinc -randkey HTTP/[email protected]
addprinc -randkey hdfs/[email protected]
addprinc -randkey mapred/[email protected]

The preceding commands will generate three principals with random passwords. We also need to generate keytab files for the hdfs and mapred principals. To do this, execute the following commands in the kadmin.local console:

xst -norandkey -k hdfs.keytab hdfs/[email protected] HTTP/[email protected]
xst -norandkey -k mapred.keytab mapred/[email protected] HTTP/[email protected]

The preceding commands will generate two files: hdfs.keytab and mapred.keytab. Copy these files to the appropriate server and place them in the /etc/hadoop/conf directory. To secure the keytab files, change the ownership of the files to hdfs:hdfs and mapred:mapred accordingly. Make sure that only these users are allowed to read the content of the file.

Before you move to the next step, you need to make sure that all the principals for all the nodes are generated and the keytab files are copied to all the servers.

Enabling Kerberos for HDFS

To enable Kerberos security, add the following option to the core-site.xml configuration file:

<name>hadoop.security.authentication</name>
<value>kerberos</value>

The default value for this variable is simple and it disables Kerberos support. Make sure you propagate changes in core-site.xml to all the servers in the cluster.

To configure Kerberos support for HDFS, you need to add the following options into the hdfs-site.xml file. It is important that this file is copied to all the HDFS servers in the cluster. Kerberos authentication is bi-directional. This means that DataNodes, for example, need to know the principal for the NameNode to communicate.

<name>dfs.block.access.token.enable</name>
<value>true</value>

<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/[email protected]</value>

<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/[email protected]</value>

<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
<name>dfs.datanode.kerberos.http.principal</name>
<value>HTTP/[email protected]</value>

<name>dfs.journalnode.kerberos.principal</name>
<value>hdfs/[email protected]</value>

<name>dfs.journalnode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>

The preceding options specify all HDFS related principals. Additionally, since we have configured NameNode High Availability, we have the specified principal for JournalNode as well. The _HOST token in these options will be replaced by a fully qualified hostname of the server at runtime.

Next, we need to provide the location of keytab files for HDFS principals:

<name>dfs.namenode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value>

<name>dfs.datanode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value>

<name>dfs.journalnode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value>

One of the security requirements, not directly related to Kerberos, is to run DataNodes services on privileged ports. Privileged ports are ports with numbers below 1024. This is done to prevent a scenario when a rogue user writes a sophisticated MapReduce job that presents itself as a valid DataNode to the cluster. When security is enabled, you must make the following changes in the configuration file:

<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>

Finally, you need to create a /etc/default/hadoop-hdfs-datanode file with the following content:

export HADOOP_SECURE_DN_USER=hdfs
export HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs
export JSVC_HOME=/usr/lib/bigtop-utils/

Enabling Kerberos for MapReduce

Changes that need to be applied to mapred-site.xml are very similar to what we have already done for HDFS. We need to provide principals and keytab file locations for JobTracker, TaskTrackers, and embedded web servers:

<name>mapreduce.jobtracker.kerberos.principal</name>
<value>mapred/[email protected]</value>
  
<name>mapreduce.jobtracker.kerberos.http.principal</name>
<value>mapred/[email protected]</value>

<name>mapreduce.tasktracker.kerberos.principal</name>
<value>HTTP/[email protected]</value>
  
<name>mapreduce.tasktracker.kerberos.http.principal</name>
<value>HTTP/[email protected]</value>

<name>mapreduce.jobtracker.keytab.file</name>
<value>/etc/hadoop/conf/mapred.keytab</value>
  
<name>mapreduce.tasktracker.keytab.file</name>
<value>/etc/hadoop/conf/mapred.keytab</value>

One thing that is specific to the MapReduce part of Hadoop when it comes to security is the fact that the user code is launched by TaskTracker in a separate JVM. This separate process, by default, is running under the user that started TaskTracker itself. This could potentially provide more permissions to the user that he or she needs. When security is enabled, TaskTracker changes the ownership of the process to a different user. This would be a user who launched the job. To support this, the following options need to be added:

<name>mapred.task.tracker.task-controller</name>
<value>org.apache.hadoop.mapred.LinuxTaskController</value>

<name>mapreduce.tasktracker.group</name>
<value>mapred</value>

Additionally, a separate taskcontroller.cfg file needs to be created in /etc/hadoop/conf. This file will specify the users who are allowed to launch tasks on this cluster. The following is the content of this file for our cluster:

mapred.local.dir=/dfs/data1/mapred,/dfs/data2/mapred,/dfs/data3/mapred,/dfs/data4/mapred,/dfs/data5/mapred,/dfs/data6/mapred  
hadoop.log.dir=/var/log/hadoop-0.20-mapreduce
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=500

When running in secure mode, TaskTracker will launch different jobs under different users. We need to specify the locations of local directories in taskcontroller.cfg to allow TaskTracker to set permissions properly. We also specified users that are not allowed to execute MapReduce tasks using the banned.users option. This is required to avoid privileged users from bypassing security checks and accessing local data. The min.user.id option will disallow any of the privileged users with IDs of less than 500 (specific for CentOS) from submitting MapReduce jobs for the same reason.

After you have propagated these configuration files on all the nodes, you will need to restart all the services in the cluster. Pay close attention to the messages in the logfiles. As you can see, configuring a secure Hadoop cluster is not a simple task, with a lot of steps involved. It is important to double check that all the services are working properly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset