INTRODUCTION

IN THIS FAST-PACED WORLD of ever-changing technology, we have been drowning in information. We are generating and storing massive quantities of data. With the proliferation of devices on our networks, we have seen an amazing growth in a diversity of information formats and data — Big Data.

But let’s face it — if we’re honest with ourselves, most of our organizations haven’t been able to proactively manage massive quantities of this data effectively, and we haven’t been able to use this information to our advantage to make better decisions and to do business smarter. We have been overwhelmed with vast amounts of data, while at the same time we have been starved for knowledge. The result for companies is lost productivity, lost opportunities, and lost revenue.

Over the course of the past decade, many technologies have promised to help with the processing and analyzing of the vast amounts of information we have, and most of these technologies have come up short. And we know this because, as programmers focused on data, we have tried it all. Many approaches have been proprietary, resulting in vendor lock-in. Some approaches were promising, but couldn’t scale to handle large data sets, and many were hyped up so much that they couldn’t meet expectations, or they simply were not ready for prime time.

When Apache Hadoop entered the scene, however, everything was different. Certainly there was hype, but this was an open source project that had already found incredible success in massively scalable commercial applications. Although the learning curve was sometimes steep, for the first time, we were able to easily write programs and perform data analytics on a massive scale — in a way that we haven’t been able to do before. Based on a MapReduce algorithm that enables us as developers to bring processing to the data distributed on a scalable cluster of machines, we have found much success in performing complex data analysis in ways that we haven’t been able to do in the past.

It’s not that there is a lack of books about Hadoop. Quite a few have been written, and many of them are very good. So, why this one? Well, when the authors started working with Hadoop, we wished there was a book that went beyond APIs and explained how the many parts of the Hadoop ecosystem work together and can be used to build enterprise-grade solutions. We were looking for a book that walks the reader through the data design and how it impacts implementation, as well as explains how MapReduce works, and how to reformulate specific business problems in MapReduce. We were looking for answers to the following questions:

  • What are MapReduce’s strengths and weaknesses, and how can you customize it to better suit your needs?
  • Why do you need an additional orchestration layer on top of MapReduce, and how does Oozie fit the bill?
  • How can you simplify MapReduce development using domain-specific languages (DSLs)?
  • What is this real-time Hadoop that everyone is talking about, what can it do, and what can it not do? How does it work?
  • How do you secure your Hadoop applications, what do you need to consider, what security vulnerabilities must you consider, and what are the approaches for dealing with them?
  • How do you transition your Hadoop application to the cloud, and what are important considerations when doing so?

When the authors started their Hadoop adventure, we had to spend long days (and often nights) browsing all over Internet and Hadoop source code, talking to people and experimenting with the code to find answers to these questions. And then we decided to share our findings and experience by writing this book with the goal of giving you, the reader, a head start in understanding and using Hadoop.

WHO THIS BOOK IS FOR

This book was written by programmers for programmers. The authors are technologists who develop enterprise solutions, and our goal with this book is to provide solid, practical advice for other developers using Hadoop. The book is targeted at software architects and developers trying to better understand and leverage Hadoop for performing not only a simple data analysis, but also to use Hadoop as a foundation for enterprise applications.

Because Hadoop is a Java-based framework, this book contains a wealth of code samples that require fluency in Java. Additionally, the authors assume that the readers are somewhat familiar with Hadoop, and have some initial MapReduce knowledge.

Although this book was designed to be read from cover to cover in a building-block approach, some sections may be more applicable to certain groups of people. Data designers who want to understand Hadoop’s data storage capabilities will likely benefit from Chapter 2. Programmers getting started with MapReduce will most likely focus on Chapters 3 through 5, and Chapter 13. Developers who have realized the complexity of not using a Workflow system like Oozie will most likely want to focus on Chapters 6 through 8. Those interested in real-time Hadoop will want to focus on Chapter 9. People interested in using the Amazon cloud for their implementations might focus on Chapter 11, and security-minded individuals may want to focus on Chapters 10 and 12.

WHAT THIS BOOK COVERS

Right now, everyone’s doing Big Data. Organizations are making the most of massively scalable analytics, and most of them are trying to use Hadoop for this purpose. This book concentrates on the architecture and approaches for building Hadoop-based advanced enterprise applications, and covers the following main Hadoop components used for this purpose:

  • Blueprint architecture for Hadoop-based enterprise applications
  • Base Hadoop data storage and organization systems
  • Hadoop’s main execution framework (MapReduce)
  • Hadoop’s Workflow/Coordinator server (Oozie)
  • Technologies for implementing Hadoop-based real-time systems
  • Ways to run Hadoop in the cloud environment
  • Technologies and architecture for securing Hadoop applications

HOW THIS BOOK IS STRUCTURED

The book is organized into 13 chapters.

Chapter 1 (“Big Data and the Hadoop Ecosystem”) provides an introduction to Big Data, and the ways Hadoop can be used for Big Data implementations. Here you learn how Hadoop solves Big Data challenges, and which core Hadoop components can work together to create a rich Hadoop ecosystem applicable for solving many real-world problems. You also learn about available Hadoop distributions, and emerging architecture patterns for Big Data applications.

The foundation of any Big Data implementation is data storage design. Chapter 2 (“Storing Data in Hadoop”) covers distributed data storage provided by Hadoop. It discusses both the architecture and APIs of two main Hadoop data storage mechanisms — HDFS and HBase — and provides some recommendations on when to use each one. Here you learn about the latest developments in both HDFS (federation) and HBase new file formats, and coprocessors. This chapter also covers HCatalog (the Hadoop metadata management solution) and Avro (a serialization/marshaling framework), as well as the roles they play in Hadoop data storage.

As the main Hadoop execution framework, MapReduce is one of the main topics of this book and is covered in Chapters 3, 4, and 5.

Chapter 3 (“Processing Your Data with MapReduce”) provides an introduction to the MapReduce framework. It covers the MapReduce architecture, its main components, and the MapReduce programming model. This chapter also focuses on MapReduce application design, design patterns, and general MapReduce “dos” and “don’ts.”

Chapter 4 (“Customizing MapReduce Execution”) builds on Chapter 3 by covering important approaches for customizing MapReduce execution. You learn about the aspects of MapReduce execution that can be customized, and use the working code examples to discover how this can be done.

Finally, in Chapter 5 (“Building Reliable MapReduce Apps”) you learn about approaches for building reliable MapReduce applications, including testing and debugging, as well as using built-in MapReduce facilities (for example, logging and counters) for getting insights into the MapReduce execution.

Despite the power of MapReduce itself, practical solutions typically require bringing multiple MapReduce applications together, which involves quite a bit of complexity. This complexity can be significantly simplified by using the Hadoop Workflow/Coordinator engine — Oozie — which is described in Chapters 6, 7, and 8.

Chapter 6 (“Automating Data Processing with Oozie”) provides an introduction to Oozie. Here you learn about Oozie’s overall architecture, its main components, and the programming language for each component. You also learn about Oozie’s overall execution model, and the ways you can interact with the Oozie server.

Chapter 7 (“Using Oozie”) builds on the knowledge you gain in Chapter 6 and presents a practical end-to-end example of using Oozie to develop a real-world application. This example demonstrates how different Oozie components are used in a solution, and shows both design and implementation approaches.

Finally, Chapter 8 (“Advanced Oozie Features”) discusses advanced features, and shows approaches to extending Oozie and integrating it with other enterprise applications. In this chapter, you learn some tips and tricks that developers need to know — for example, how dynamic generation of Oozie code allows developers to overcome some existing Oozie shortcomings that can’t be resolved in any other way.

One of the hottest trends related to Big Data today is the capability to perform “real-time analytics.” This topic is discussed in Chapter 9 (“Real-Time Hadoop”). The chapter begins by providing examples of real-time Hadoop applications used today, and presents the overall architectural requirements for such implementations. You learn about three main approaches to building such implementations — HBase-based applications, real-time queries, and stream-based processing.

This chapter provides two examples of HBase-based, real-time applications — a fictitious picture-management system, and a Lucene-based search engine using HBase as its back end. You also learn about the overall architecture for implementation of a real-time query, and the way two concrete products — Apache Drill and Cloudera’s Impala — implement it. This chapter also covers another type of real-time application — complex event processing — including its overall architecture, and the way HFlame and Storm implement this architecture. Finally, this chapter provides a comparison between real-time queries, complex event processing, and MapReduce.

An often skipped topic in Hadoop application development — but one that is crucial to understand — is Hadoop security. Chapter 10 (“Hadoop Security”) provides an in-depth discussion about security concerns related to Big Data analytics and Hadoop — specifically, Hadoop’s security model and best practices. Here you learn about the Project Rhino — a framework that enables developers to extend Hadoop’s security capabilities, including encryption, authentication, authorization, Single-Sign-On (SSO), and auditing.

Cloud-based usage of Hadoop requires interesting architectural decisions. Chapter 11 (“Running Hadoop Applications on AWS”) describes these challenges, and covers different approaches to running Hadoop on the Amazon Web Services (AWS) cloud. This chapter also discusses trade-offs and examines best practices. You learn about Elastic MapReduce (EMR) and additional AWS services (such as S3, CloudWatch, Simple Workflow, and so on) that can be used to supplement Hadoop’s functionality.

Apart from securing Hadoop itself, Hadoop implementations often integrate with other enterprise components — data is often imported into Hadoop and also exported. Chapter 12 (“Building Enterprise Security Solutions for Hadoop Implementations”) covers how enterprise applications that use Hadoop are best secured, and provides examples and best practices.

The last chapter of the book, Chapter 13 (“Hadoop’s Future”), provides a look at some of the current and future industry trends and initiatives that are happening with Hadoop. Here you learn about availability and use of Hadoop DSLs that simplify MapReduce development, as well as a new MapReduce resource management system (YARN) and MapReduce runtime extension (Tez). You also learn about the most significant Hadoop directions and trends.

WHAT YOU NEED TO USE THIS BOOK

All of the code presented in the book is implemented in Java. So, to use it, you will need a Java compiler and development environment. All development was done in Eclipse, but because every project has a Maven pom file, it should be simple enough to import it into any development environment of your choice.

All the data access and MapReduce code has been tested on both Hadoop 1 (Cloudera CDH 3 distribution and Amazon EMR) and Hadoop 2 (Cloudera CDH 4 distribution). As a result, it should work with any Hadoop distribution. Oozie code was tested on the latest version of Oozie (available, for example, as part of Cloudera CDH 4.1 distribution).

The source code for the samples is organized in Eclipse projects (one per chapter), and is available for download from the Wrox website at:

www.wrox.com/go/prohadoopsolutions

CONVENTIONS

To help you get the most from the text and keep track of what’s happening, we’ve used a number of conventions throughout the book.


NOTE This indicates notes, tips, hints, tricks, and/or asides to the current discussion.

As for styles in the text:

  • We highlight new terms and important words when we introduce them.
  • We show keyboard strokes like this: Ctrl+A.
  • We show filenames, URLs, and code within the text like so: persistence.properties.
  • We present code in two different ways:
We use a monofont type with no highlighting for most code examples.
We use bold to emphasize code that is particularly important in the present context or to show changes from a previous code snippet.

SOURCE CODE

As you work through the examples in this book, you may choose either to type in all the code manually, or to use the source code files that accompany the book. Source code for this book is available for download at www.wrox.com. Specifically, for this book, the code download is on the Download Code tab at:

www.wrox.com/go/prohadoopsolutions

You can also search for the book at www.wrox.com by ISBN (the ISBN for this book is 978-1-118-61193-7) to find the code. And a complete list of code downloads for all current Wrox books is available at www.wrox.com/dynamic/books/download.aspx.

Throughout selected chapters, you’ll also find references to the names of code files as needed in listing titles and text.

Most of the code on www.wrox.com is compressed in a .ZIP, .RAR archive, or similar archive format appropriate to the platform. Once you download the code, just decompress it with an appropriate compression tool.


NOTE Because many books have similar titles, you may find it easiest to search by ISBN; this book’s ISBN is 978-1-118-61193-7.

Alternatively, you can go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspx to see the code available for this book and all other Wrox books.

ERRATA

We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you find an error in one of our books, like a spelling mistake or faulty piece of code, we would be very grateful for your feedback. By sending in errata, you may save another reader hours of frustration, and at the same time, you will be helping us provide even higher quality information.

To find the errata page for this book, go to:

www.wrox.com/go/prohadoopsolutions

Click the Errata link. On this page, you can view all errata that has been submitted for this book and posted by Wrox editors.

If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/techsupport.shtml and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in subsequent editions of the book.

P2P.WROX.COM

For author and peer discussion, join the P2P forums at http://p2p.wrox.com. The forums are a web-based system for you to post messages relating to Wrox books and related technologies, and to interact with other readers and technology users. The forums offer a subscription feature to e-mail you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums.

At http://p2p.wrox.com, you will find a number of different forums that will help you, not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps:

1. Go to http://p2p.wrox.com and click the Register link.
2. Read the terms of use and click Agree.
3. Complete the required information to join, as well as any optional information you wish to provide, and click Submit.
4. You will receive an e-mail with information describing how to verify your account and complete the joining process.

NOTE You can read messages in the forums without joining P2P, but in order to post your own messages, you must join.

Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the web. If you would like to have new messages from a particular forum e-mailed to you, click the Subscribe to this Forum icon by the forum name in the forum listing.

For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works, as well as many common questions specific to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset