CHAPTER 3

image

Introducing Hadoop Security

We live in a very insecure world. Starting with the key to your home’s front door to those all-important virtual keys, your passwords, everything needs to be secured—and well. In the world of Big Data where humungous amounts of data are processed, transformed, and stored, it’s all the more important to secure your data.

A few years back, the London Police arrested a group of young people for fraud and theft of digital assets worth $30 million. Their 20-year-old leader used Zeus Trojan, software designed to steal banking information, from his laptop to commit the crime. Incidents like these are commonplace because of the large amount of information and myriad systems involved even while conducting simple business transactions. In the past, there were probably only thousands who could potentially access your data to commit a crime against you; now, with the advent of the Internet, there are potentially billions! Likewise, before Big Data existed, only direct access to specific data on specific systems was a danger; now, Big Data multiplies the places such information is stored and hence provides more ways to compromise your privacy or worse. Everything in the new technology-driven, Internet-powered world has been scaled up and scaled out—crime and the potential for crime included.

Imagine if your company spent a couple of million dollars installing a Hadoop cluster to gather and analyze your customers’ spending habits for a product category using a Big Data solution. Because that solution was not secure, your competitor got access to that data and your sales dropped 20% for that product category. How did the system allow unauthorized access to data? Wasn’t there any authentication mechanism in place? Why were there no alerts? This scenario should make you think about the importance of security, especially where sensitive data is involved.

Although Hadoop does have inherent security concerns due to its distributed architecture (as you saw in Chapter 2), the situation described is extremely unlikely to occur on a Hadoop installation that’s managed securely. A Hadoop installation that has clearly defined user roles and multiple levels of authentication (and encryption) for sensitive data will not let any unauthorized access go through.

This chapter serves as a roadmap for the rest of the book. It provides a brief overview of each of the techniques you need to implement to secure your Hadoop installation; later chapters will then cover the topics in more detail. The purpose is to provide a quick overview of the security options and also help you locate relevant techniques quickly as needed. I start with authentication (using Kerberos), move on to authorization (using Hadoop ACLs and Apache Sentry), and then discuss secure administration (audit logging and monitoring). Last, the chapter examines encryption for Hadoop and available options. I have used open source software wherever possible, so you can easily build your own Hadoop cluster to try out some of the techniques described in this book.

As the foundation for all that, however, you need to understand the way Hadoop was developed and also a little about the Hadoop architecture. Armed with this background information, you will better understand the authentication and authorization techniques discussed later in the chapter.

Starting with Hadoop Security

When talking about Hadoop security, you have to consider how Hadoop was conceptualized. When Doug Cutting and Mike Cafarella started developing Hadoop, security was not exactly the priority. I am certain it was not even considered as part of the initial design. Hadoop was meant to process large amounts of web data in the public domain, and hence security was not the focus of development. That’s why it lacked a security model and only provided basic authentication for HDFS—which was not very useful, since it was extremely easy to impersonate another user.

Another issue is that Hadoop was not designed and developed as a cohesive system with predefined modules, but was rather developed as a collage of modules that either correspond to various open source projects or a set of (proprietary) extensions developed by various vendors to supplement functionality lacking within the Hadoop ecosystem.

Therefore, Hadoop assumes the isolation of (or a cocoon of) a trusted environment for its cluster to operate without any security violations—and that’s lacking most of the time. Right now, Hadoop is transitioning from an experimental or emerging technology stage to enterprise-level and corporate use. These new users need a way to secure sensitive business data.

Currently, the standard community-supported way of securing a Hadoop cluster is to use Kerberos security. Hadoop and its major components now fully support Kerberos authentication. That merely adds a level of authentication, though. With just Kerberos added there is still no consistent built-in way to define user roles for finer control across components, no way to secure access to Hadoop processes (or daemons) or encrypt data in transit (or even at rest). A secure system needs to address all these issues and also offer more features and ways to customize security for specific needs. Throughout the book, you will learn how to use these techniques with Hadoop. For now, let’s start with a brief look at a popular solution to address Hadoop’s authentication issue.

Introducing Authentication and Authorization for HDFS

The first and most important consideration for security is authentication. A user needs to be authenticated before he is allowed to access the Hadoop cluster. Since Hadoop doesn’t do any secure authentication, Kerberos is often used with Hadoop to provide authentication.

When Kerberos is implemented for security, a client (who is trying to access Hadoop cluster) contacts the KDC (the central Kerberos server that hosts the credential database) and requests access. If the provided credentials are valid, KDC provides requested access. We can divide the Kerberos authentication process into three main steps:

  1. TGT generation, where Authentication Server (AS) grants the client a Ticket Granting Ticket (TGT) as an authentication token. A client can use the same TGT for multiple TGS requests (until the TGT expires).
  2. TGS session ticket generation, where the client uses credentials to decrypt TGT and then uses TGT to get a service ticket from the Ticket Granting Server (TGS) that is granting server access to a Hadoop cluster.
  3. Service access, where the client uses the service ticket to authenticate and access a Hadoop cluster.

Chapter 4 discusses the details of Kerberos architecture and also how Kerberos can be configured to be used with Hadoop. In addition, you’ll find a step-by-step tutorial that will help you in setting up Kerberos to provide authentication for your Hadoop cluster.

Authorization

When implementing security, your next step is authorization. Specifically, how can you implement fine-grained authorization and roles in Hadoop? The biggest issue is that all information is stored in files, just like on a Linux host (HDFS is, after all, a file system). There is no concept of a table (like relational databases) and that makes it harder to authorize a user for partial access to the stored data.

Whether you call it defining details of authorization, designing fine-grained authorization, or “fine tuning” security, it’s a multistep process. The steps are:

  1. Analyze your environment,
  2. Classify data for access,
  3. Determine who needs access to what data,
  4. Determine the level of necessary access, and
  5. Implement your designed security model.

However, you have to remember that Hadoop (and its distributed file system) stores all its data in files, and hence there are limitations to the granularity of security you can design. Like Unix or Linux, Hadoop has a permissions model very similar to the POSIX-based (portable operating system interface) model—and it’s easy to confuse those permissions for Linux permissions—so the permission granularity is limited to read or write permissions to files or directories. You might say, “What’s the problem? My Oracle or PostgreSQL database stores data on disk in files, why is Hadoop different?” Well, with the traditional database security model, all access is managed through clearly defined roles and channeled through a central server process. In contrast, with data files stored within HDFS, there is no such central process and multiple services like Hive or HBase can directly access HDFS files.

To give you a detailed understanding of the authorization possible using file/directory-based permissions, Chapter 5 discusses the concepts, explains the logical process, and also provides a detailed real-world example. For now, another real-world example, this one of authorization, will help you understand the concept better.

Real-World Example for Designing Hadoop Authorization

Suppose you are designing security for an insurance company’s claims management system, and you have to assign roles and design fine-grained access for all the departments accessing this data. For this example, consider the functional requirements of two departments: the call center and claims adjustors.

Call center representatives answer calls from customers and then file or record claims if they satisfy all the stipulated conditions (e.g. damages resulting from “acts of God” do not qualify for claims and hence a claim can’t be filed for them).

A claims adjustor looks at the filed claims and rejects those that violate any regulatory conditions. That adjustor then submits the rest of the claims for investigation, assigning them to specialist adjustors. These adjustors evaluate the claims based on company regulations and their specific functional knowledge to decide the final outcome.

Automated reporting programs pick up claims tagged with a final status “adjusted” and generate appropriate letters to be mailed to the customers, informing them of the claim outcome. Figure 3-1 summarizes the system.

9781430265443_Fig03-01.jpg

Figure 3-1. Claims data and access needed by various departments

As you can see, call center representatives will need to append claim data and adjustors will need to modify data. Since HDFS doesn’t have a provision for updates or deletes, adjustors will simply need to append a new record or row (for a claim and its data) with updated data and a new version number. A scheduled process will need to generate a report to look for adjusted claims and mail the final claim outcome to the customers. That process, therefore, will need read access to the claims data.

In Hadoop, remember, data is stored in files. For this example, data is stored in file called Claims. Daily data is stored temporarily in file called Claims_today and appended to the Claims file on a nightly basis. The call center folks use the group ccenter, while the claims adjustors use the group claims, meaning the HDFS permissions on Claims and Claims_today look like those shown in Figure 3-2.

9781430265443_Fig03-02.jpg

Figure 3-2. HDFS file permissions

The first file, Claims_today, has write permissions for owner and the group ccuser. So, all the representatives belonging to this group can write or append to this file.

The second file, Claims, has read and write permissions for owner and the group claims. So, all the claims adjustors can read Claims data and append new rows for the claims that they have completed their work on, and for which they are providing a final outcome. Also, notice that you will need to create a user named Reports within the group claims for accessing the data for reporting.

Image Note  The permissions discussed in the example are HDFS permissions and not the operating system permissions. Hadoop follows a separate permissions model that appears to be the same as Linux, but the preceding permissions exist within HDFS—not Linux.

So, do these permissions satisfy all the functional needs for this system? You can verify easily that they do. Of course the user Reports has write permissions that he doesn’t need; but other than that, all functional requirements are satisfied.

We will discuss this topic with a more detailed example in Chapter 5. As you have observed, the permissions you assigned were limited to complete data files. However, in the real world, you may need your permissions granular enough to access only parts of data files. How do you achieve that? The next section previews how.

Fine-Grained Authorization for Hadoop

Sometimes the necessary permissions for data don’t match the existing group structure for an organization. For example, a bank may need a backup supervisor to have the same set of permissions as a supervisor, just in case the supervisor is on vacation or out sick. Because the backup supervisor might only need a subset of the supervisor’s permissions, it is not practical to design a new group for him or her. Also, consider another situation where corporate accounts are being moved to a different department, and the group that’s responsible for migration needs temporary access.

New versions of HDFS support ACL (Access Control List) functionality, and this will be very useful in such situations. With ACLs you can specify read/write permissions for specific users or groups as needed. In the bank example, if the backup supervisor needs write permission to a specific “personal accounts” file, then the HDFS ACL feature can be used to provide the necessary write permission without making any other changes to file permissions. For the migration scenario, the group that’s performing migration can be assigned read/write permissions using HDFS ACL. In case you are familiar with POSIX ACLs, HDFS ACLs work exactly the same way. Chapter 5 discusses Hadoop ACLs again in detail in the “Access Control Lists for HDFS” section.

Last, how do you configure permissions only for part of a data file or certain part of data? Maybe a user needs to have access to nonsensitive information only. The only way you can configure further granularity (for authorization) is by using a NoSQL database such as Hive and specialized software such as Apache Sentry. You can define parts of file data as tables within Hive and then use Sentry to configure permissions. Sentry works with users and groups of users (called groups) and lets you define rules (possible actions on tables such as read or write) and roles (a group of rules). A user or group can have one or multiple roles assigned to them. Chapter 5 provides a real-world example using Hive and Sentry that explains how fine-tuned authorization can be defined for Hadoop. “Role-Based Authorization with Apache Sentry in Chapter 5 also has architectural details for Apache Sentry.

Securely Administering HDFS

Chapters 4 and 5 will walk you through various techniques of authentication and authorization, which help secure your system but are not a total solution. What if authorized users access resources that they are not authorized to use, or unauthorized users access resources on a Hadoop cluster using unforeseen methods (read: hacking)? Secure administration helps you deal with these scenarios by monitoring or auditing all access to your cluster. If you can’t stop this type of access, you at least need to know it occurred! Hadoop offers extensive logging for all its processes (also called daemons), and several open source tools can assist in monitoring a cluster. (Chapters 6 and 7 discuss audit logging and monitoring in detail.)

Securely administering HDFS presents a number of challenges, due to the design of HDFS and the way it is structured. Monitoring can help with security by alerting you of unauthorized access to any Hadoop cluster resources. You then can design countermeasures for malicious attacks based on the severity of these alerts. Although Hadoop provides metrics for this monitoring, they are cumbersome to use. Monitoring is much easier when you use specialized software such as Nagios or Ganglia. Also, standard Hadoop distributions by Cloudera and Hortonworks provide their own monitoring modules. Last, you can capture and monitor MapReduce counters.

Audit logs supplement the security by recording all access that flows through to your Hadoop cluster. You can decide the level of logging (such as only errors, or errors and warnings, etc.), and advanced log management provided by modules like Log4j provides a lot of control and flexibility for the logging process. Chapter 6 provides a detailed overview (with an example) of the audit logging available with Hadoop. As a preview, the next section offers a brief overview of Hadoop logging.

Using Hadoop Logging for Security

When a security issue occurs, having extensive activity logs available can help you investigate the problem. Before a breach occurs, therefore, you should enable audit logging to track all access to your system. You can always filter out information that’s not needed. Even if you have enabled authentication and authorization, auditing cluster activity still has benefits. After all, even authorized users may perform tasks they are not authorized to do; for example, a user with update permissions could update an entry without appropriate approvals. You have to remember, however, that Hadoop logs are raw output. So, to make them useful to a security administrator, tools to ingest and process these logs are required (note that some installations use Hadoop itself to analyze the audit logs, so you can use Hadoop to protect Hadoop!).

Just capturing the auditing data is not enough. You need to capture Hadoop daemon data as well. Businesses subject to federal oversight laws like the Health Information Portability and Accountability Act (HIPAA) and the Sarbanes-Oxley Act (SOX) are examples of this need. For example, US law requires that all businesses covered by HIPAA prevent unauthorized access to “Protected Health Information” (patients’ names, addresses, and all information pertaining to the patients’ health and payment records) or applications that audit it. Businesses that must comply with SOX (a 2002 US federal law that requires the top management of any US public company to individually certify the accuracy of their firm’s financial information), must audit all access to any data object (e.g., table) within an application. They also must monitor who submitted, managed, or viewed a job that can change any data within an audited application. For business cases like these, you need to capture:

  • HDFS audit logs (to record all HDFS access activity within Hadoop),
  • MapReduce audit logs (record all submitted job activity), and
  • Hadoop daemon log files for NameNode, DataNode, JobTracker and TaskTracker.

The Log4j API is at the heart of Hadoop logging, be it audit logs or Hadoop daemon logs. The Log4j module provides extensive logging capabilities and contains several logging levels that you can use to limit the outputting of messages by category as well as limiting (or suppressing) the messages by their category. For example, if Log4j logging level is defined as INFO for NameNode logging, then an event will be written to NameNode log for any file access request that the NameNode receives (i.e. all the informational messages will be written to NameNode log file).

You can easily change the logging level for a Hadoop daemon at its URL. For example, http://jobtracker-host:50030/logLevel will change the logging level while this daemon is running, but it will be reset when it is restarted. If you encounter a problem, you can temporarily change the logging level for the appropriate daemon to facilitate debugging. When the problem is resolved, you can reset the logging level. For a permanent change to log level for a daemon, you need to change the corresponding property in the Log4j configuration file (log4j.properties).

The Log4j architecture uses a logger (a named channel for log events such as NameNode, JobTracker, etc.), an Appender (to which a log event is forwarded and which is responsible for writing it to console or a file), and a layout (a formatter for log events). The logging levels—FATAL, ERROR, WARN, INFO, DEBUG, and TRACE—indicate the severity of events in descending order. The minimum log level is used as a filter; log events with a log level greater than or equal to that which is specified are accepted, while less severe events are simply discarded.

Figure 3-3 demonstrates how level filtering works. The columns show the logging levels, while the rows show the level associated with the appropriate Logger Configuration. The intersection identifies whether the Event would be allowed to pass for further processing (YES) or discarded (NO). Using Figure 3-3 you can easily determine what category of events will be included in the logs, depending on the logging level configured. For example, if logging level for NameNode is set at INFO, then all the messages belonging to the categories INFO, WARN, ERROR and FATAL will be written to the NameNode log file. You can easily identify this, looking at the column INFO and observing the event levels that are marked as YES. The levels TRACE and DEBUG are marked as NO and will be filtered out. If logging level for JobTracker is set to FATAL, then only FATAL errors will be logged, as is obvious from the values in column FATAL.

9781430265443_Fig03-03.jpg

Figure 3-3. Log4j logging levels and inclusions based on event levels

Chapter 6 will cover Hadoop logging (as well as its use in investigating security issues) comprehensively. You’ll get to know the main features of monitoring in the next section.

Monitoring for Security

When you think of monitoring, you probably think about possible performance issues that need troubleshooting or, perhaps, alerts that can be generated if a system resource (such as CPU, memory, disk space) hits a threshold value. You can, however, use monitoring for security purposes as well. For example, you can generate alerts if a user tries to access cluster metadata or reads/writes a file that contains sensitive data, or if a job tries to access data it shouldn’t. More importantly, you can monitor a number of metrics to gain useful security information.

It is more challenging to monitor a distributed system like Hadoop because the monitoring software has to monitor individual hosts and then consolidate that data in the context of the whole system. For example, CPU consumption on a DataNode is not as important as the CPU consumption on the NameNode. So, how will the system process CPU consumption alerts, or be capable of identifying separate threshold levels for hosts with different roles within the distributed system? Chapter 7 answers these questions in detail, but for now let’s have a look at the Hadoop metrics that you can use for security purposes:

  • Activity statistics on the NameNode
  • Activity statistics for a DataNode
  • Detailed RPC information for a service
  • Health monitoring for sudden change in system resources

Tools of the Trade

The leading monitoring tools are Ganglia (http://ganglia.sourceforge.net) and Nagios (www.nagios.org). These popular open source tools complement each other, and each has different strengths. Ganglia focuses on gathering metrics and tracking them over a time period, while Nagios focuses more on being an alerting mechanism. Because gathering metrics and alerting both are essential aspects of monitoring, they work best in conjunction. Both Ganglia and Nagios have agents running on all hosts for a cluster and gather information.

Ganglia

Conceptualized at the University of California, Berkeley, Ganglia is an open source monitoring project meant to be used with large distributed systems. Each host that’s part of the cluster runs a daemon process called gmond that collects and sends the metrics (like CPU usage, memory usage, etc.) from the operating system to a central host. After receiving all the metrics, the central host can display, aggregate, or summarize them for further use.

Ganglia is designed to integrate easily with other applications and gather statistics about their operations. For example, Ganglia can easily receive output data from Hadoop metrics and use it effectively. Gmond (which Ganglia has running on every host) has a very small footprint and hence can easily be run on every machine in the cluster without affecting user performance.

Ganglia’s web interface (Figure 3-4) shows you the hardware used for the cluster, cluster load in the last hour, CPU and memory resource consumption, and so on. You can have a look at the summary usage for last hour, day, week, or month as you need. Also, you can get details of any of these resource usages as necessary. Chapter 7 will discuss Ganglia in greater detail.

9781430265443_Fig03-04.jpg

Figure 3-4. Ganglia monitoring system: Cluster overview

Nagios

Nagios provides a very good alerting mechanism and can use metrics gathered by Ganglia. Earlier versions of Nagios polled information from its target hosts but currently it uses plug-ins that run agents on hosts (that are part of the cluster). Nagios has an excellent built-in notification system and can be used to deliver alerts via pages or e-mails for certain events (e.g., NameNode failure or disk full). Nagios can monitor applications, services, servers, and network infrastructure. Figure 3-5 shows the Nagios web interface, which can easily manage status (of monitored resources), alerts (defined on resources), notifications, history, and so forth.

9781430265443_Fig03-05.jpg

Figure 3-5. Nagios web interface for monitoring

The real strength of Nagios is the hundreds of user-developed plug-ins that are freely available to use. Plug-ins are available in all categories. For example, the System Metrics category contains the subcategory Users, which contains plug-ins such as Show Users that can alert you when certain users either log in or don’t. Using these plug-ins can cut down valuable customization time, which is a major issue for all open source (and non–open source) software. Chapter 7 discusses the details of setting up Nagios.

Encryption: Relevance and Implementation for Hadoop

Being a distributed system, Hadoop has data spread across a large number of hosts and stored locally. There is a large amount of data communication between these hosts; hence data is vulnerable in transit as well as when at rest and stored on local storage. Hadoop started as a data store for collecting web usage data as well as other forms of nonsensitive large-volume data. That’s why Hadoop doesn’t have any built-in provision for encrypting data.

Today, the situation is changing and Hadoop is increasingly being used to store sensitive warehoused data in the corporate world. This has created a need for the data to be encrypted in transit and at rest. Now there are a number of alternatives available to help you encrypt your data.

Encryption for Data in Transit

Internode communication in Hadoop uses protocols such as RPC, TCP/IP, and HTTP. RPC communication can be encrypted using a simple Hadoop configuration option and is used for communication between NameNode, JobTracker, DataNodes, and Hadoop clients. That leaves the actual read/write of file data between clients and DataNodes (TCP/IP) and HTTP communication (web consoles, communication between NameNode/Secondary NameNode, and MapReduce shuffle data) unencrypted.

It is possible to encrypt TCP/IP or HTTP communication, but that requires use of Kerberos or SASL frameworks. The current version of Hadoop allows network encryption (in conjunction with Kerberos) by setting explicit values in the configuration files core-site.xml and hdfs-site.xml. Chapter 4 will revisit this detailed setup and discuss network encryption at length.

Encryption for Data at Rest

There are a number of choices for implementing encryption at rest with Hadoop, but they are offered by different vendors and rely on their distributions to implement encryption. Most notable are the Intel Project Rhino (committed to the Apache Software Foundation and open source) and AWS (Amazon Web Services) offerings, which provide encryption for data stored on disk.

Because Hadoop usually deals with large volumes of data and encryption/decryption takes time, it is important that the framework used performs the encryption/decryption fast enough that it doesn’t impact performance. The Intel solution (shortly to be offered through the Cloudera distribution) claims to perform these operations with great speed—provided that Intel CPUs are used along with Intel disk drives and all the other related hardware. Let’s have a quick look at some details of Amazon’s encryption “at rest” option.

AWS encrypts data stored within HDFS and also supports encrypted data manipulation by other components such as Hive or HBase. This encryption can be transparent to users (if the necessary passwords are stored in configuration files) or can prompt them for passwords before allowing access to sensitive data, can be applied on a file-by-file basis, and can work in combination with external key management applications. This encryption can use symmetric as well as asymmetric keys. To use this encryption, sensitive files must be encrypted using a symmetric or asymmetric key before they are stored in HDFS.

When an encrypted file is stored within HDFS, it remains encrypted. It is decrypted as needed for processing and re-encrypted before it is moved back into storage. The results of the analysis are also encrypted, including intermediate results. Data and results are neither stored nor transmitted in unencrypted form. Figure 3-6 provides an overview of the process. Data stored in HDFS is encrypted using symmetric keys, while MapReduce jobs use symmetric keys (with certificates) for transferring encrypted data.

9781430265443_Fig03-06.jpg

Figure 3-6. Details of at-rest encryption provided by Intel’s Hadoop distribution (now Project Rhino)

Chapter 8 will cover encryption in greater detail. It provides an overview of encryption concepts and protocols and then briefly discusses two options for implementing encryption: using Intel’s distribution (now available as Project Rhino) and using AWS to provide transparent encryption.

Summary

With a roadmap in hand, finding where you want to go and planning how to get there is much easier. This chapter has been your roadmap to techniques for designing and implementing security for Hadoop. After an overview of Hadoop architecture, you investigated authentication using Kerberos to provide secure access. You then learned how authorization is used to specify the level of access, and that you need to follow a multistep process of analyzing data and needs to define an effective authorization strategy.

To supplement your security through authentication and authorization, you need to monitor for unauthorized access or unforeseen malicious attacks continuously; tools like Ganglia or Nagios can help. You also learned the importance of logging all access to Hadoop daemons using the Log4j logging system and Hadoop daemon logs as well as audit logs.

Last, you learned about encryption of data in transit (as well as at rest) and why it is important as an additional level of security—because it is the only way to stop unauthorized access for hackers that have bypassed authentication and authorization layers. To implement encryption for Hadoop, you can use solutions from AWS (Amazon web services) or Intel’s Project Rhino.

For the remainder of the book, you’ll follow this roadmap, digging deeper into each of the topics presented in this chapter. We’ll start in Chapter 4 with authentication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset