Introduction

Last year, I was designing security for a client who was looking for a reference book that talked about security implementations in the Hadoop arena, simply so he could avoid known issues and pitfalls. To my chagrin, I couldn’t locate a single book for him that covered the security aspect of Hadoop in detail or provided options for people who were planning to secure their clusters holding sensitive data! I was disappointed and surprised. Everyone planning to secure their Hadoop cluster must have been going through similar frustration. So I decided to put my security design experience to broader use and write the book myself.

As Hadoop gains more corporate support and usage by the day, we all need to recognize and focus on the security aspects of Hadoop. Corporate implementations also involve following regulations and laws for data protection and confidentiality, and such security issues are a driving force for making Hadoop “corporation ready.”

Open-source software usually lacks organized documentation and consensus on performing a particular functional task uniquely, and Hadoop is no different in that regard. The various distributions that mushroomed in last few years vary in their implementation of various Hadoop functions, and some, such as authorization or encryption, are not even provided by all the vendor distributions. So, in this way, Hadoop is like Unix of the ’80s or ’90s: Open source development has led to a large number of variations and in some cases deviations from functionality. Because of these variations, devising a common strategy to secure your Hadoop installation is difficult. In this book, I have tried to provide a strategy and solution (an open source solution when possible) that will apply in most of the cases, but exceptions may exist, especially if you use a Hadoop distribution that’s not well-known.

It’s been a great and exciting journey developing this book, and I deliberately say “developing,” because I believe that authoring a technical book is very similar to working on a software project. There are challenges, rewards, exciting developments, and of course, unforeseen obstacles—not to mention deadlines!

Who This Book Is For

This book is an excellent resource for IT managers planning a production Hadoop environment or Hadoop administrators who want to secure their environment. This book is also for Hadoop developers who wish to implement security in their environments, as well as students who wish to learn about Hadoop security. This book assumes a basic understanding of Hadoop (although the first chapter revisits many basic concepts), Kerberos, relational databases, and Hive, plus an intermediate-level understanding of Linux.

How This Book Is Structured

The book is divided in five parts: Part I, “Introducing Hadoop and Its Security,” contains Chapters 1, 2, and 3; Part II, “Authenticating and Authorizing Within Your Hadoop Cluster,” spans Chapters 4 and 5; Part III, “Audit Logging and Security Monitoring,” houses Chapters 6 and 7; Part IV, “Encryption for Hadoop,” contains Chapter 8; and Part V holds the four appendices.

Here’s a preview of each chapter in more detail:

  • Chapter 1, “Understanding Security Concepts,” offers an overview of security, the security engineering framework, security protocols (including Kerberos), and possible security attacks. This chapter also explains how to secure a distributed system and discusses Microsoft SQL Server as an example of secure system.
  • Chapter 2, “Introducing Hadoop,” introduces the Hadoop architecture and Hadoop Distributed File System (HDFS), and explains the security issues inherent to HDFS and why it’s easy to break into a HDFS installation. It also introduces Hadoop’s MapReduce framework and discusses its security shortcomings. Last, it discusses the Hadoop Stack.
  • Chapter 3, “Introducing Hadoop Security,” serves as a roadmap to techniques for designing and implementing security for Hadoop. It introduces authentication (using Kerberos) for providing secure access, authorization to specify the level of access, and monitoring for unauthorized access or unforeseen malicious attacks (using tools like Ganglia or Nagios). You’ll also learn the importance of logging all access to Hadoop daemons (using the Log4j logging system) and importance of data encryption (both in transit and at rest).
  • Chapter 4, “Open Source Authentication in Hadoop,” discusses how to secure your Hadoop cluster using open source solutions. It starts by securing a client using PuTTY, then describes the Kerberos architecture and details a Kerberos implementation for Hadoop step by step. In addition, you’ll learn how to secure interprocess communication that uses the RPC (remote procedure call) protocol, how to encrypt HTTP communication, and how to secure the data communication that uses DTP (data transfer protocol).
  • Chapter 5, “Implementing Granular Authorization,” starts with ways to determine security needs (based on application) and then examines methods to design fine-grained authorization for applications. Directory- and file-level permissions are demonstrated using a real-world example, and then the same example is re-implemented using HDFS Access Control Lists and Apache Sentry with Hive.
  • Chapter 6, “Hadoop Logs: Relating and Interpretation,” discusses the use of logging for security. After a high-level discussion of the Log4j API and how to use it for audit logging, the chapter examines the Log4j logging levels and their purposes. You’ll learn how to correlate Hadoop logs to implement security effectively, get a look at Hadoop analytics and a possible implementation using Splunk.
  • Chapter 7, “Monitoring in Hadoop,” discusses monitoring for security. It starts by discussing features that a monitoring system needs, with an emphasis on monitoring distributed clusters. Thereafter, it discusses the Hadoop metrics you can use for security purposes and examines the use of Ganglia and Nagios, the two most popular monitoring applications for Hadoop. It concludes by discussing some helpful plug-ins for Ganglia and Nagios that provide security-related functionality and also discusses Ganglia integration with Nagios.
  • Chapter 8, “Encryption in Hadoop,” begins with some data encryption basics, discusses popular encryption algorithms and their applications (certificates, keys, hash functions, digital signatures), defines what can be encrypted for a Hadoop cluster, and lists some of the popular vendor options for encryption. A detailed implementation of HDFS and Hive data at rest follows, showing Intel’s distribution in action. The chapter concludes with a step-by-step implementation of encryption at rest using Elastic MapReduce VM (EMR) from Amazon Web Services.

Downloading the Code

The source code for this book is available in ZIP file format in the Downloads section of the Apress web site (www.apress.com).

Contacting the Author

You can reach Bhushan Lakhe at [email protected] or [email protected].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset