CHAPTER 5

image

Implementing Granular Authorization

Designing fine-grained authorization reminds me of a story of a renowned bank manager who was very disturbed by a robbery attempt made on his safe deposit vault. The bank manager was so perturbed that he immediately implemented multiple layers of security and passwords for the vault. The next day, a customer request required that he open the vault. The manager, in all his excitement, forgot the combination, and the vault had to be forced open (legally, of course).

As you may gather, designing fine-grained security is a tricky proposition. Too much security can be as counterproductive as too little. There is no magic to getting it just right. If you analyze all your processes (both manual and automated) and classify your data well, you can determine who needs access to which specific resources and what level of access is required. That’s the definition of fine-grained authorization: every user has the correct level of access to necessary resources. Fine-tuning Hadoop security to allow access driven by functional need will make your Hadoop cluster less vulnerable to hackers and unauthorized access—without sacrificing usability.

In this chapter, you will learn how to determine security needs (based on application) and then examine ways to design high-level security and fine-grained authorization for applications, using directory and file-level permissions. To illustrate, I’ll walk you through a modified real-world example involving traffic ticket data and access to that data by police, the courts, and reporting agencies. The chapter wraps up with a discussion of implementing fine-grained authorization using Apache Sentry, revisiting the traffic ticket example to highlight Sentry usage with Hive, a database that works with HDFS. By the end of this chapter, you will have a good understanding of how to design fine-grained authorization.

Designing User Authorization

Defining the details of fine-grained authorization is a multistep process. Those steps are:

  1. Analyze your environment,
  2. Classify data for access,
  3. Determine who needs access to what data,
  4. Determine the level of necessary access, and
  5. Implement your designed security model.

The following sections work through this complete process to define fine-grained authorization for a real-world scenario.

Call the Cops: A Real-World Security Example

I did some work for the Chicago Police Department a few years back involving the department’s ticketing system. The system essentially has three parts: mobile consoles in police cars, a local database at the local police station, and a central database at police headquarters in downtown Chicago. Why is fine-tuned authorization important in this scenario? Consider the potential for abuse without it: if the IT department has modification permissions for the data, for example, someone with a vested interest could modify data for a particular ticket. The original system was developed using Microsoft SQL Server, but for my example, I will redesign it for a Hadoop environment. Along the way, you’ll also learn how a Hadoop implementation is different from a relational database–based implementation.

Analyze and Classify Data

The first step is inspecting and analyzing the system (or application) involved. In addition, reviewing the high-level objective and use cases for the system helps clarify access needs. Don’t forget maintenance, backup, and disaster recovery when considering use cases. A system overview is a good starting point, as is reviewing the manual processes involved (in their logical order). In both cases, your goals are to understand the functional requirements within each process, to understand how processes interact with each other, to determine what data is generated within each process and to track how that data is communicated to the next process. Figure 5-1 illustrates the analysis of a system.

9781430265443_Fig05-01.jpg

Figure 5-1. Analyzing a system or an application

In my example, the first process is the generation of ticket data by a police officer (who issues the ticket). That data gets transferred to the database at a local police station, and obviously needs to have modification rights for the ticketing officer, his or her supervisor at the station, and of course upper management at police headquarters.

Other police officers at the local station need read permissions for this data, as they might want to have a look at all the tickets issued on a particular day or at a person’s driving history while deciding whether to issue a ticket or only a warning. Thus, a police officer looks up the ticket data (using the driver’s Social Security number, or SSN) at the local police station database (for the current day) as well as at the central database located at police headquarters.

As a second process, the ticket data from local police stations (from all over the Chicago area) gets transmitted to the central database at police headquarters on a nightly basis.

The third and final process is automated generation of daily reports every night for supervisors at all police stations. These reports summarize the day’s ticketing activity and are run by a reporting user (created by IT).

Details of Ticket Data

This ticket data is not a single entity, but rather a group of related entities that hold all the data. Understanding the design of the database holding this data will help in designing a detailed level of security.

Two tables, or files in Hadoop terms, hold all the ticket data. Just as tables are used to store data in a relational database, HDFS uses files. In this case, assume Hadoop stored the data as a comma-delimited text file. (Of course, Hadoop supports a wide range of formats to store the data, but a simple example facilitates better understanding of the concepts.) The table and file details are summarized in Figure 5-2.

9781430265443_Fig05-02.jpg

Figure 5-2. Details of ticket data: classification of information and storage in tables versus files

The first table, Driver_details, holds all the personal details of the driver: full legal name, SSN, current address, phone number, and so on. The second table, Ticket_details, has details of the ticket: ticket number, driver’s SSN, offense code, issuing officer, and so forth.

Also, these tables are “related” to each other. The relational notation indicates the fact that every driver (featuring in Driver_details) may have one or more tickets to his name, the details of which are in Ticket_details. How can a ticket be related to a driver? By using the SSN. The SSN is a primary key (indicated as PK) or unique identifier for the Driver_details table because an SSN identifies a driver uniquely. Since a driver may have multiple tickets, however, the SSN is not a unique identifier for the Ticket_details table and is only used to relate the tickets to a driver (indicated as FK or foreign key).

Please understand that the ticket data example is simplistic and just demonstrates how granular permissions can be used. In addition, it makes these assumptions:

  • The example uses Hadoop 2.x, since we will need to append data to all our data files and earlier versions didn’t support appends. All the day’s tickets from local police stations will be appended every night to appropriate data files located at police headquarters.
  • Records won’t be updated, but a status flag will be used to indicate the active record (the most recent record being flagged active and earlier ones inactive).
  • There is no concurrent modification to records.
  • There are no failures while writing data that will compromise the integrity of the system.
  • The functional need is that only the police officer (who issues the ticket), that officer’s supervisor, and higher management should have rights to modify a ticket—but this desired granularity is not possible with HDFS! Hive or another NoSQL database needs to be used for that purpose. For now, I have just provided modification rights to all police officers. In the next section, however, you will learn how to reimplement this example using Hive and Sentry to achieve the desired granularity.

A production system would likely be much more complex and need a more involved design for effective security.

Getting back to our example, how do we design roles for securing the ticket data and providing access based on need? Our design must satisfy all of the functional needs (for all processes within the system) without providing too much access (due to sensitivity of data). The next part of this section explains how.

Determine Access Groups and their Access Levels

Based on the functional requirements of the three processes, read and write (modify) access permissions are required for the ticket data. The next question is, what groups require which permissions (Figure 5-3)? Three subgroups need partial read and write access to this data; call them Group 1:

  • Ticket-issuing police officer
  • Local police supervisor
  • Higher management at headquarters

9781430265443_Fig05-03.jpg

Figure 5-3. Group access to ticket data with detailed access permissions

Group 2, the IT department at police headquarters, needs read access. Figure 5-3 illustrates this access.

Table 5-1 lists the permissions.

Table 5-1. Permission Details for Groups and Entities (Tables)

Table

Group 1

Group 2

Driver_details

Read/write

Read

Ticket_details

Read/write

Read

So, to summarize, analyze and classify your data, then determine the logical groups that need access to appropriate parts of the data. With those insights in hand, you can design roles for defining fine-grained authorization and determine the groups of permissions that are needed for these roles.

Logical design (even a very high-level example like the ticketing system) has to result in a physical implementation. Only then can you have a working system. The next section focuses on details of implementing the example’s design.

Implement the Security Model

Implementing a security model is a multistep process. Once you have a good understanding of the roles and their permissions needs, you can begin. These are the steps to follow:

  1. Secure data storage.
  2. Create users and groups as necessary.
  3. Assign ownerships, groups and permissions.

Understanding a few basic facts about Hadoop file permissions will help you in this process.

Ticket Data Storage in Hadoop

For the ticketing system example, I will start with implementation of data storage within HDFS. As you saw earlier, data in Hadoop is stored in the files Driver_details and Ticket_details. These files are located within the root data directory of Hadoop, as shown in Figure 5-4. To better understand the figure, consider some basic facts about HDFS file permissions.

9781430265443_Fig05-04.jpg

Figure 5-4. HDFS directory and file permisssions

  • HDFS files have three sets of permissions, those for owner, group, and others. The permissions are specified using a ten-character string, such as -rwxrwxrwx.
  • The first character indicates directory or file (- for file or d for directory), the next three characters indicate permissions for the file’s owner, the next three for the owner’s group, and last three for other groups.
  • Possible values for any grouping are r (read permission), w (write permission), x (permission to execute), or (a placeholder). Note that x is valid only for executable files or directories.

In Figure 5-4, the owner of the files Driver_details and ticket_details (root) has rw- permissions, meaning read and write permissions. The next three characters are permissions for group (meaning all the users who belong to the group this file is owned by, in this case hdfs). The permissions for group are rw-, indicating all group members have read and write permissions for this file. The last three characters indicate permissions for others (users who don’t own the file and are not a part of the same group this file is owned by). For this example, others have read permissions only (r--).

Adding Hadoop Users and Groups to Implement File Permissions

As a final step in implementing basic authorization for this system, I need to define appropriate users and groups within Hadoop and adjust file permissions.

First, I create groups for this server corresponding to the example’s two groups: Group 1 is called POfficers and Group 2 is ITD.

[root@sandbox ~]# groupadd POfficers
[root@sandbox ~]# groupadd ITD

Listing and verifying the groups is a good idea:

[root@sandbox ~]# cut –d: -f1 /etc/group | grep POfficers

I also create user Puser for group POfficers and user Iuser for group ITD:

[root]# useradd Puser –gPOfficers
[root]# useradd Iuser –gITD

Next, I set up passwords:

[root]# passwd Puser
Changing password for user Puser.
New password:
Retype password:
Passwd: all authentication tokens updated successfully.

Now, as a final step, I allocate owners and groups to implement the permissions. As you can see in Figure 5-5, owners for the files Driver_details and Ticket_details are changed to the dummy user Puser, and group permissions are set to write; so users from group POfficers (all police officers) will have read/write permissions and users from other groups (viz. IT department) will have read permission only.

9781430265443_Fig05-05.jpg

Figure 5-5. Changing owner and group for HDFS files

Comparing Table 5-1 to the final permissions for all the entities (same-named files in HDFS), you will see that the objective has been achieved: Puser owns the files Driver_details and Ticket_details and belongs to group POfficers (Group 1). The permissions -rw-rw-r-- indicate that any one from Group 1 has read/write permissions, while users belonging to any other group (e.g., Group 2) only have read permissions.

This example gave you a basic idea about fine-tuning authorization for your Hadoop cluster. Unfortunately, the real world is complex, and so are the systems we have to work with! So, to make things a little more real-world, I’ll extend the example, and you can see what happens next to the ticket.

Extending Ticket Data

Tickets only originate with the police. Eventually, the courts get involved to further process the ticket. Thus, some of the ticket data needs to be shared with the judicial system. This group needs read as well as modifying rights on certain parts of the data, but only after the ticket is processed through traffic court. In addition, certain parts of the ticket data need to be shared with reporting agencies who provide this data to insurance companies, credit bureaus, and other national entities as required.

These assumptions won’t change the basic groups, but will require two new ones: one for the judiciary (Group 3) and another for reporting agencies (Group 4). Now the permissions structure looks like Figure 5-6.

9781430265443_Fig05-06.jpg

Figure 5-6. Group access to ticket data with detailed access permissions, showing new groups

With the added functionality and groups, data will have to be added as well. For example, the table Judgement_details will contain the ticket’s judicial history, such as case date, final judgment, ticket payment details, and more. Hadoop will store this table in a file by the same name (Figure 5-7).

9781430265443_Fig05-07.jpg

Figure 5-7. Details of ticket data—classification of information—with added table for legal details

Like Figure 5-2, Figure 5-7 also illustrates how data would be held in tables if a relational database was used for storage. This is just to compare data storage in Hadoop with data storage in a relational database system. As I discussed earlier, data stored within a relational database system is related: driver data (the Driver_details table) is related to ticket data (the Ticket_details table) using SSN to relate or link the data. With the additional table (Judgement_details), court judgment for a ticket is again related or linked with driver and ticket details using SSN.

Hadoop, as you know, uses files for data storage. So, as far as Hadoop is concerned, there is one additional data file for storing data related to judiciary—Judgement_details. There is no concept of relating or linking data stored within multiple files. You can, of course, link the data programmatically, but Hadoop doesn’t do that automatically for you. It is important to understand this difference when you store data in HDFS.

The addition of a table will change the permissions structure as well, as you can see in Table 5-2.

Table 5-2. Permission Details for Groups and Entities

Table5-2.jpg

Adding new groups increases the permutations of possible permissions , but isn’t helpful in addressing complex permission needs (please refer to the section “Role-Based Authorization with Apache Sentry” to learn about implementing granular permissions). For example, what if the police department wanted only the ticket-issuing officer and the station superintendent to have write permission for a ticket? The groups defined in Figure 5-7 and Table 5-2 clearly could not be used to implement this requirement. For such complex needs, Hadoop provides access control lists (ACLs), which are very similar to ACLs used by Unix and Linux.

Access Control Lists for HDFS

As per the HDFS permission model, for any file access request HDFS enforces permissions for the most specific user class applicable. For example, if the requester is the file owner, then owner class permissions are checked. If the requester is a member of group owning the file, then group class permissions are checked. If the requester is not a file owner or member of the file owner’s group, then others class permissions are checked.

This permission model works well for most situations, but not all. For instance, if all police officers, the manager of the IT department, and the system analyst responsible for managing the ticketing system need write permission to the Ticket_details and Driver_details files, the four existing groups would not be sufficient to implement these security requirements. You could create a new owner group called Ticket_modifiers, but keeping the group’s membership up to date could be problematic due to personnel turnover (people changing jobs), as well as wrong or inadequate permissions caused by manual errors or oversights.

Used for restricting access to data, ACLs provide a very good alternative in such situations where your permission needs are complex and specific. Because HDFS uses the same (POSIX-based) permission model as Linux, HDFS ACLs are modeled after POSIX ACLs that Unix and Linux have used for a long time. ACLs are available in Apache Hadoop 2.4.0 as well as all the other major vendor distributions.

You can use the HDFS ACLs to define file permissions for specific users or groups in addition to the file’s owner and group. ACL usage for a file does result in additional memory usage for NameNode, however, so your best practice is to reserve ACLs for exceptional circumstances and use individual and group ownerships for regular security implementation.

To use ACLs, you must first enable them on the NameNode by adding the following configuration property to hdfs-site.xml and restarting the NameNode:

<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
< /property>

Once you enable ACLs, two new commands are added to the HDFS CLI (command line interface): setfacl and getfacl. The setfacl command assigns permissions. With it, I can set up write and read permissions for the ticketing example’s IT Manager (ITMgr) and Analyst (ITAnalyst):

> sudo -u hdfs hdfs dfs -setfacl -m user:ITMgr:rw- /Driver_details
> sudo -u hdfs hdfs dfs -setfacl -m user:ITAnalyst:rw- /Driver_details

With getfacl, I can verify the permissions:

> hdfs dfs -getfacl /Driver_details
# file: /Driver_details
# owner: Puser
# group: POfficers
user::rw-
user:ITAnalyst:rw-
user:ITMgr:rw-
group::r--
mask::rw-
other::r--

When ACL is enabled the file listing shows a + in permissions:

> hdfs dfs -ls /Driver_details
-rw-rw-r--+ 1 Puser POfficers         19 2014-09-19 18:42 /Driver_details

You might have situations where specific permissions need to be applied to all the files in a directory or to all the subdirectories and files for a directory. In such cases, you can specify a default ACL for a directory, which will be automatically applied to all the newly created child files and subdirectories within that directory:

> sudo -u hdfs hdfs dfs -setfacl -m default:group:POfficers:rwx /user

Verifying the permissions shows the default settings were applied:

> hdfs dfs -getfacl /user

# file: /user
# owner: hdfs
# group: hdfs
user::rwx
group::r-x
other::r-x
default:user::rwx
default:group::r-x
default:group:POfficers:rwx
default:mask::rwx
default:other::r-x

Note that in our simple example I left rw- access for all users from group POfficers, so the ACLs really do not restrict anything. In a real-world application, I would most likely have restricted the group POfficers to have less access (probably just read access) than the approved ACL-defined users.

Be aware that hdfs applies the default ACL only to newly created subdirectories or files; application of a default ACL or subsequent changes to the default ACL of a parent directory are not automatically applied to the ACL of existing subdirectories or files.

You can also use ACLs to block access to a directory or a file for a specific user without accidentally revoking permissions for any other users. Suppose an analyst has been transferred to another department and therefore should no longer have access to ticketing information:

> sudo -u hdfs hdfs dfs -setfacl -m user:ITAnalyst:--- /Driver_details

Verify the changes:

> hdfs dfs -getfacl /Driver_details

# file: /Driver_details
# owner: Puser
# group: POfficers
user::rw-
user:ITAnalyst:---
user:ITMgr:rw-
group::r--
mask::rw-
other::r--

The key to effectively using ACLs is to understand the order of evaluation for ACL entries when a user accesses a HDFS file. The permissions are evaluated and enforced in the following order:

  1. If the user owns the file, then the owner permissions are enforced.
  2. If the user has an ACL entry, then those permissions are enforced.
  3. If the user is a member of group (of file ownership), then those permissions are used.
  4. If there is an ACL entry for a group and the user is a member of that group, then those permissions are used.
  5. If the user is a member of a file group or any other group with ACL entry denying access to the file, then the user is denied access (to the file). If user is a member of multiple groups, then union of permissions for all matching entries is enforced.
  6. Last, if no other permissions are applicable, then permissions for the group others are used.

To summarize, HDFS ACLs are useful for implementing complex permission needs or to provide permissions to a specific user or group different from the file ownership. Remember, however, to use ACLs judiciously, because files with ACLs result in higher memory usage for NameNode. If you do plan to use ACLs, make sure to take this into account when sizing your NameNode memory.

Role-Based Authorization with Apache Sentry

Sentry is an application that provides role-based authorization for data stored in HDFS and was developed and committed by Cloudera to the Apache Hadoop community. It provides granular authorization that’s very similar to that of a relational database. As of this writing, Sentry is the most mature open source product that offers RBAC (role-based authorization control) for data stored within HDFS, although another project committed by Hortonworks (Argus) is a challenger. Sentry currently works in conjunction with Hive (database/data warehouse made available by the Apache Software Foundation) and Impala (query engine developed by Cloudera and inspired by Google’s Dremel).

Hive Architecture and Authorization Issues

Hive is a database that works with HDFS. Its query language is syntactically very similar to SQL and is one of the reasons for its popularity. The main aspects of the database to remember are the following:

  • Hive structures data into familiar database concepts such as tables, rows, columns, and partitions.
  • Hive supports primitive data types: integers, floats, doubles, and strings.
  • Hive tables are HDFS directories (and files within).
  • Partitions (for a Hive table) are subdirectories within the “table” HDFS directory.
  • Hive privileges exist at the database or table level and can be granted to a user, group, or role.
  • Hive privileges are select (read), update (modify data), and alter (modify metadata).

Hive isn’t perfect, however. It uses a repository (Metastore) for storing metadata related to tables, so you can potentially have a mismatch between table metadata and HDFS permissions if permissions for underlying HDFS objects are changed directly. Hive doesn’t have the capability to prevent or identify such a situation. Therefore, it’s possible that a user is granted select permissions on a table, but has update or write permissions on the corresponding directory/files within HDFS, through the user’s operating system user or group. Also, Hive has no way of providing permissions for specific parts of table data or partial table data. There is no way to provide column-level permissions, define views (for finer data access control), or define server level roles.

Sentry addresses some of these issues. It provides roles at the server, database, and table level and can work with Hive external tables —which you can use for partial data access control for users.

Figure 5-8 illustrates Hive’s architecture and where it fits in respect to HDFS.

9781430265443_Fig05-08.jpg

Figure 5-8. Hive architecture and its authorization

Sentry Architecture

A security module that integrates with Hive and Impala, Sentry offers advanced authorization controls that enable more secure access to HDFS data. We will focus on Sentry integration with Hive (since it is used more extensively). Sentry uses rules to specify precise permissions for a database object and roles to combine or consolidate the rules, thereby making it easy to group permissions for different database objects while offering flexibility to create rules for various types of permissions (such as select or insert).

Creating Rules and Roles for Sentry

Sentry grants precise control to specify user access to subsets of data within a database or a schema or a table using a rule. For example, if a database db1 has table called Employee, then a rule providing access for Insert can be:

server=MyServer->db=db1->table=Employee->action=Insert

A role is a set of rules to access Hive objects. Individual rules are comma separated and grouped to form a role.

For example, the Employee_Maint role can be specified as:

Employee_Maint = server=Myserver->db=db1->table=Employee->action=Insert, 
server=server1->db=db1->table=Employee_Dept->action=Insert,
server=server1->db=db1->table=Employee_salary->action=Insert

Here, the Employee_Maint role enables any user (who has the role) to insert rows within tables Employee, Employee_Dept, and Employee_salary.

Role-based authorization simplifies managing permissions since administrators can create templates for groupings of privileges based on functional roles within their organizations.

Multidepartment administration empowers central administrators to deputize individual administrators to manage security settings for each separate database or schema using database-level roles. For example, in the following code, the DB2_Admin role authorizes all permissions for database db2 and Svr_Admin authorizes all permissions for server MyServer:

DB2_Admin = server=MyServer->db=db2
Svr_Admin = server=MyServer

Creating rules and roles within Sentry is only the first step. Roles need to be assigned to users and groups if you want to use them. How does Sentry identify users and groups? The next section explains this.

Understanding Users and Groups within Sentry

A user is someone authenticated by the authentication subsystem and permitted to access the Hive service. Because the example assumes Kerberos is being used, a user will be a Kerberos principal. A group is a set of one or more users that have been granted one or more authorization roles. Sentry currently supports HDFS-backed groups and locally configured groups (in the configuration file policy.xml). For example, consider the following entry in policy.xml:

Supervisor = Employee_Maint, DB2_Admin

If Supervisor is a HDFS-backed group, then all the users belonging to this group can execute any HiveQL statements permitted by the roles Employee_Maint and DB2_Admin. However, if Supervisor is a local group, then users belonging to this group (call them ARoberts and MHolding) have to be defined in the file policy.xml:

[users]
ARoberts = Supervisor
MHolding = Supervisor

Figure 5-9 demonstrates where Sentry fits in the Hadoop architecture with Kerberos authentication.

9781430265443_Fig05-09.jpg

Figure 5-9. Hadoop authorization with Sentry

To summarize, after reviewing Hive and Sentry architectures, you gained an understanding of the scope of security that each offers. You had a brief look at setting up rules, roles, users, and groups. So, you are now ready to reimplement the ticketing system (using Sentry) defined in the earlier sections of this chapter.

Implementing Roles

Before reimplementing the ticketing system with the appropriate rules, roles, users, and groups, take a moment to review its functional requirements. A ticket is created by the police officer who issues the ticket. Ticket data is stored in a database at a local police station and needs to have modification rights for all police officers. The IT department located at police headquarters needs read permission on this data for reporting purposes. Some of the ticket data is shared by the judicial system, and they need read as well as modifying rights to parts of data, because data is modified after a ticket is processed through traffic court. Last, certain parts of this data need to be shared with reporting agencies that provide this data to insurance companies, credit bureaus, and other national agencies as required. Table 5-3 summarizes the requirements; for additional detail, consult Figure 5-7.

Table 5-3. Permission Details for Groups and Entities

Table5-3.jpg

The original implementation using HDFS file permissions was easy but did not consider the following issues:

  • When a ticket gets created, a judiciary record (a case) is created automatically with the parent ticket_id (indicating what ticket this case is for) and case details. The police officer should have rights to insert this record in the Judgement_details table with ticket details, but shouldn’t be allowed to modify columns for judgment and other case details. File permissions aren’t flexible enough to implement this requirement.
  • The judge (assigned for a case) should have modification rights for columns with case details, but shouldn’t have modification rights to columns with ticket details. Again, file permissions can’t handle this.

To implement these requirements, you need Sentry (or its equivalent). Then, using Hive, you need to create external tables with relevant columns (the columns where judiciary staff or police officers need write access) and provide write access for the appropriate departments to those external tables instead of Ticket_details and Judgement_details tables.

For this example, assume that the cluster (used for implementation) is running CDH4.3.0 (Cloudera Hadoop distribution 4.3.0) or later and has HiveServer2 with Kerberos authentication installed.

As a first step, you need to make a few configuration changes. Change ownership of the Hive warehouse directory (/user/hive/warehouse or any other path specified as value for property hive.metastore.warehouse.dir in Hive configuration file hive-site.xml) to the user hive and group hive. Set permissions on the warehouse directory as 770 (rwxrwx---), meaning read, write, and execute permissions for owner and group; but no permissions for others or users not belonging to the group hive. You can set the property hive.warehouse.subdir.inherit.perms to true in hive-site.xml, to make sure that permissions on the subdirectories will be set to 770 as well. Next, change the property hive.server2.enable.doAs to false. This will execute all queries as the user running service Hiveserver2. Last, set the property min.user.id to 0 in configuration file taskcontroller.cfg. This is to ensure that the hive user can submit MapReduce jobs.

Having made these configuration changes, you’re ready to design the necessary tables, rules, roles, users, and groups.

Designing Tables

You will need to create the tables Driver_details, Ticket_details, and Judgement_details, as well as an external table, Judgement_details_PO, as follows:

CREATE TABLE  Driver_details (SocialSecNum STRING,
Last Name STRING,
First Name STRING,
Address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
Phone BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION "/Driver_details";

CREATE TABLE  Ticket_details (TicketId BIGINT,
DriverSSN STRING,
Offense STRING,
Issuing Officer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION "/Ticket_details";

CREATE TABLE  Judgement_details (CaseID BIGINT,
TicketId BIGINT,
DriverSSN STRING,
CaseDate STRING,
Judge STRING,
Judgement STRING,
TPaymentDetails STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION "/Judgement_details";

CREATE EXTERNAL TABLE  Judgement_details_PO (CaseID BIGINT,
TicketId BIGINT,
DriverSSN STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION "/user/hive/warehouse/Judgement_details";

If you refer to Figure 5-5, you will observe that I am using the same columns (as we have in Hadoop files or tables) to create these tables and just substituting the data type as necessary (e.g., the Last Name is a character string, or data type STRING; but TicketId is a big integer or BIGINT). The last table, Judgement_details_PO, is created as a Hive external table, meaning Hive only manages metadata for this table and not the actual datafile. I created this table as an external table with the first two columns of the table Judgement_details because I need certain resources to have permissions to modify these two columns only—not the other columns in that table.

Designing Rules

I need to design rules to provide the security required to implement the ticketing system. The example has four tables, and various roles are going to need Read (Select) or Modify (Insert) rights, because there are no “updates” for Hive or HDFS data. I will simply append (or Insert) the new version of a record. So, here are the rules:

server=MyServer->db=db1->table=Driver_details->action=Insert
server=MyServer->db=db1->table=Ticket_details->action=Insert
server=MyServer->db=db1->table=Judgement_details->action=Insert
server=MyServer->db=db1->table=Judgement_details_PO->action=Insert
server=MyServer->db=db1->table=Driver_details->action=Select
server=MyServer->db=db1->table=Ticket_details->action=Select
server=MyServer->db=db1->table=Judgement_details->action=Select

These rules simply perform Select or Modify actions for all the tables.

Designing Roles

Let’s design roles using the rules we created. The first role is for all police officers:

PO_role = server=Myserver->db=db1->table= Driver_details ->action=Insert, 
server=MyServer->db=db1->table= Driver_details ->action=Select,
server=MyServer->db=db1->table= Ticket_details ->action=Insert,
server=MyServer->db=db1->table= Ticket_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details_PO ->action=Insert

Notice that this role allows all the police officers to have read/write permissions to tables Driver_details and Ticket_details but only read permission to Judgement_details. The reason is that police officers shouldn’t have permission to change the details of judgment. You will also observe that police officers have write permission to Judgement_details_PO and that is to correct the first two columns (that don’t have any judicial information)—in case there is any error!

The next role is for employees working at the IT department:

IT_role = server=MyServer->db=db1->table= Driver_details ->action=Select, 
server=MyServer->db=db1->table= Ticket_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details ->action=Select

The IT employees have only read permissions on all the tables because they are not allowed to modify any data.

The role for Judiciary is as follows:

JU_role = server=MyServer->db=db1->table= Judgement_details ->action=Insert, 
server=MyServer->db=db1->table= Driver_details ->action=Select,
server=MyServer->db=db1->table= Ticket_details ->action=Select

The judiciary has read permissions for driver and ticket data (because they are not supposed to modify it) but write permission to enter the judicial data because only they are allowed to modify it.

Last, for the Reporting agencies the role is simple:

RP_role = server=MyServer->db=db1->table=Judgement_details->action=Select

The Reporting agencies have read permissions on the Judgement_details table only because they are allowed to report the judgement. All other data is confidential and they don’t have any permissions on it.

Setting Up Configuration Files

I have to set up the various configuration files for Sentry to incorporate the roles that we have set up earlier. The first file is sentry-provider.ini and that defines per-database policy files (with their locations), any server level or database level roles, and Hadoop groups with their assigned (server level or database level) roles. Here’s how sentry-provider.ini will look for our example:

[databases]
# Defines the location of the per DB policy file for the db1 DB/schema
db1 = hdfs://Master:8020/etc/sentry/customers.ini

[groups]
# Assigns each Hadoop group to its set of roles
db1_admin = db1_admin_role
admin = admin_role

[roles]
# Implies everything on MyServer -> db1. Privileges for
# db1 can be defined in the global policy file even though
# db1 has its only policy file. Note that the Privileges from
# both the global policy file and the per-DB policy file
# are merged. There is no overriding.
db1_admin_role = server=MyServer->db=db1

# Implies everything on server1
admin_role = server=MyServer

In the example’s case, there is a specific policy file for database db1 (customers.ini) and is defined with its location. Administrator roles for the server and database db1 are defined (admin_role, db1_admin_role). Appropriate Hadoop groups (db1_admin, admin) are assigned to those administrator roles.

The next file is db1.ini. It is the per-database policy file for database db1:

[groups]
POfficers = PO_role
ITD = IT_role
Judiciary = JU_role
Reporting = RP_role

[roles]
PO_role = server=MyServer->db=db1->table= Driver_details ->action=Insert,
server=MyServer->db=db1->table= Driver_details ->action=Select,
server=MyServer->db=db1->table= Ticket_details ->action=Insert,
server=MyServer->db=db1->table= Ticket_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details_PO ->action=Insert

IT_role = server=MyServer->db=db1->table= Driver_details ->action=Select,
server=MyServer->db=db1->table= Ticket_details ->action=Select,
server=MyServer->db=db1->table= Judgement_details ->action=Select

JU_role = server=MyServer->db=db1->table= Judgement_details ->action=Insert,
server=MyServer->db=db1->table= Driver_details ->action=Select,
server=MyServer->db=db1->table= Ticket_details ->action=Select

RP_role = server=MyServer->db=db1->table=Judgement_details->action=Select

Notice above that I have defined all the roles (designed earlier) in the roles section. The groups section maps Hadoop groups to the defined roles. Now, I previously set up Hadoop groups POfficers and ITD. I will need to set up two additional groups (Judiciary and Reporting) because I mapped roles to them in db1.ini file.

The last step is setting up Sentry configuration file sentry-site.xml:

<configuration>
  <property>
    <name>hive.sentry.provider</name>
    <value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
  </property>

  <property>
    <name>hive.sentry.provider.resource</name>
    <value>hdfs://Master:8020/etc/sentry/authz-provider.ini</value>
  </property>

  <property>
    <name>hive.sentry.server</name>
    <value>Myserver</value>
  </property>
</configuration>

Last, to enable Sentry, we need to add the following properties to hive-site.xml:

<property>
<name>hive.server2.session.hook</name>
<value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>

<property>
<name>hive.sentry.conf.url</name>
<value>hdfs://Master:8020/etc/sentry-site.xml</value>
</property>

This concludes reimplementation of the ticketing system example using Apache Sentry. It was possible to specify the correct level of authorization for our ticketing system because Sentry allows us to define rules and roles that limit access to data as necessary. Without this flexibility, either too much access would be assigned or no access would be possible.

Summary

One of the few applications that offers role-based authorization for Hadoop data, Sentry is a relatively new release and still in its nascent state. Even so, it offers a good start in implementing role-based security, albeit nowhere close to the type of security an established relational database technology offers. True, Sentry has a long way to go in offering anything comparable to Oracle or Microsoft SQL Server, but currently it’s one of the few options available. That’s also the reason why the best practice is to supplement Sentry capabilities with some of Hive’s features!

You can use Hive to supplement and extend Sentry’s functionality. For example, in the ticketing example, I used the external table feature of Hive to create a role that provided write permission on only some columns of the table. Sentry by itself is not capable of offering partial write permission on a table, but you can use it in combination with Hive to offer such a permission. I encourage you to study other useful Hive features and create your own roles that can extend Sentry’s functionality. The Apache documentation at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL provides many useful suggestions

Last, the chapter’s ticketing example proved that you can provide partial data access (number of columns starting from first column) to a role by defining an external table in Hive. Interestingly, you can’t provide access to only some columns (e.g., columns four to eight) for a table using Sentry. Of course, there are other ways of implementing such a request using features that Hive provides!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset