MapReduce security

MapReduce security is focused around jobs submission and administration. By default, it is wide open. Any user who has access to the JobTracker service can submit, view, and kill jobs. Such behavior can be acceptable for the development or POC clusters, but obviously fails for the multitenant production environment.

To address these problems, Hadoop supports the notion of cluster administrator and queue administrators. Cluster and queue administrators are Linux users and groups that have permissions to see and manipulate running jobs. Administrators could, for example, change job priority or kill any running job.

If you recall, in Chapter 2, Installing and Configuring Hadoop, we have configured our JobTracker to use FairScheduler. With this scheduler, you can define a fixed set of job queues and allow specific users and groups to submit jobs to them. Each queue can also be configured with a custom list of administrators.

To enable the permissions model, you need to make some changes to your mapred-site.xml file:

<name>mapred.acls.enabled</name>
<value>true</value>

Next, you need to set the cluster level mapred administrators:

<name>mapred.cluster.administrators</name>
<value>alice,bob admin</value>

In the preceding examples, we have assigned administrator access to Linux users named alice,bob and to all the users in the admin Linux group. You can specify multiple users and groups with comma-separated lists. The users list must be separated from the group list with a space. The "*" symbol means everyone can perform administrative tasks on MapReduce jobs.

Very often, production Hadoop clusters execute jobs submitted by different groups within an organization. Such groups can have different priorities and their jobs can be of different importance. For example, there can be a production group whose jobs provide data for business critical applications and a analytics group performing background data mining. You can define which user can access which queue, as well as assign separate administrators for each queue.

First of all, you need to create a list of named queues in mapred-site.xml:

<name>mapred.queue.names</name>
<value>production,analytics</value>

Permissions for each job queue are defined in a separate file called mapred-queue-acls.xml. This file needs to be placed in the /etc/hadoop/conf directory on JobTracker. CDH provides a template file mapred-queues.xml.template, which you can use as a baseline.

The format of this file is a little bit different from the other Hadoop configuration files. The following is an example of what it may look like:

<queues>
 <queue>
   <name>production</name>
    <acl-submit-job> prodgroup</acl-submit-job>
    <acl-administer-jobs>alice </acl-administer-jobs>
 </queue>
 <queue>
   <name>analytics</name>
    <acl-submit-job> datascience</acl-submit-job>
    <acl-administer-jobs>bob </acl-administer-jobs>
 </queue>
</queues>

In the preceding example, we have defined two queues: production and analytics. Each queue supports a list of users and groups who can submit jobs to this queue, as well as the list of administrators. For the production group, we have limited submission rights only to the prodgroup Linux group using the acl-submit-job option. Note that there are no individual users listed and there is a leading space character before the group name. We have chosen alice as the production queue administrator and specified it using the acl-administer-jobs option. This particular configuration does not have a group in the list of administrators, and so a space character follows the username.

After you have made all the changes to mapred-site.xml, you need to restart the JobTracker service. Changes to mapred-queue-acls.xml are picked up automatically and no restart is required.

To submit a job to a given queue, you can use the mapred.job.queue.name option. For example, to submit a WordCount job into the analytics queue, you can use the following command:

# hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount -Dmapred.job.queue.name=analytics /tmp/word_in /tmp/word_out 

You can monitor the list of active queues and jobs that are assigned to a particular queue by running the following mapred command:

# mapred queue –list
# mapred queue -info analytics –showJobs
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset