CHAPTER 1

image

Introducing Giraph

This chapter covers

  • Using large amounts of graph data to obtain deeper insights
  • Understanding the specific objectives and design goals of Giraph
  • Positioning Giraph in the Hadoop ecosystem
  • Identifying the differences between Giraph and the other graph-processing tools

This chapter discusses the importance of data-driven decision making and how to execute it with Big Data. You see examples of how using large amounts of data can bring added value. You also learn how Giraph fits among the myriad of tools you can use to analyze large datasets. The chapter positions Giraph in the Hadoop ecosystem and introduces how it plays with its teammates. In addition, the chapter positions Giraph in contrast with other graph-based technologies, such as graph databases, and explains when and why you should use each technology. This is an introductory chapter, and as such it introduces concepts and notions that are explored more deeply in the following chapters.

Data, Data, Data

A couple of years ago, the New York Times published a fascinating article about data analytics.1 According to the story, a father near Minneapolis complained to a Target employee about his teenage daughter receiving coupons for pregnancy-related items. “My daughter got this in the mail!” he said, pointing out the printed coupon. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The Target representative apologized, and then called the father a few days later to apologize further—but the conversation took an unexpected turn. “I had a talk with my daughter,” the father said. “It turns out there have been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

Target, like other retailer shops, assigns each customer a unique ID (such as their credit card number) and uses these IDs to record the customer purchases in stores and online. By analyzing this historical data, companies are able to profile customers and shopping behavior. One of the things they can recognize is pregnancy. Not only that, but they can estimate the due date within a small time window. Apparently lotions are good indicators; many people buy lotion, but pregnant women tend to buy large quantities of unscented lotions around their second trimester. Moreover, in their first trimester, they buy supplements like calcium, magnesium, and zinc. These products are not specific to pregnancy, but a sudden spike in consumption, along with the purchase of other products such as hand sanitizers and washcloths, acts as a signal for pregnancy and closeness to the delivery date. Target’s statisticians analyzed customer purchases to figure out these patterns.

It is likely that in the near future, when you walk through shops and malls, your mobile phone will notify you of offers that fit your current needs and tastes, like those shoes you always wanted or the spaghetti you are running out of. But purchase data is only one example of data that can be used to learn more about customer behavior. We are surrounded by new devices and sensors that empower us to measure and record the world around us with increasing precision. As Mark Weiser put it, “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”2 It is probably time for our ubiquitous computing devices to disappear into the fabric of everyday life, with their Internet connections and their set of full-fledged sensors. When this happens, we will be provided with a volume of data that we have not had to manage before. To allow these technologies to effectively integrate and disappear, the infrastructure that manages the data they produce must evolve with them.

The abilities provided by these technologies affect us deeply and broadly, including the way we communicate and socialize. The more we can connect and share, the more we do. As illustrated in Figure 1-1, the total amount of data created and shared—from documents to pictures to tweets—in 2011 was of 2 zettabytes,3 around nine times the amount in 2005; and the predicted amount for 2015 is 8 zettabytes. This data includes information about social behavior, finances, weather, biology, and so on. It is not only valuable for the sciences but can also be used effectively by enterprises to understand their customers and provide better services.

9781484212523_Fig01-01.jpg

Figure 1-1. Past and predicted amount of data created and shared

Now consider this. We act differently depending on our mood. When we feel good and happy, we tend to open up to the world, try new things, and take risks. On the other hand, a negative mood can make us pessimistic and less prone to new adventures. Recently, scientists have taken this idea to an extreme and have tried to see if there is a connection between people’s moods and their behavior on the stock market. Do they invest differently, buying or selling certain stocks, depending on how they feel? Data exists about stock-market trends, but it is much more difficult to collect data about the mood of millions of people every second. So the scientists looked at Twitter. People post on Twitter about all sorts of things: what they are doing, a movie they liked, something that made them happy or angry, jokes and quotes, and so on. Tweets are simple and contain more emotional content than blog posts or articles. As such, the text in tweets can be quite representative of the mood of the people who wrote them—and they write thousands every second.

Sentiment analysis is an application of natural language processing (NLP) techniques to determine the attitude of a writer toward a particular topic. Simply put, by looking at the words and characters used, such as adjectives, verbs, emoticons, and punctuation, it is possible to assign that text a positive or negative classification. Sometimes it is possible to assign an evaluation of the affective state of the writer, such as “happy,” “sad,” or “angry.” These techniques are applied to social media to measure customer attitudes toward certain brands, or voter attitudes regarding candidates during political campaigns.

Returning to the connection between moods and the stock market, scientists were able to show that after a number of days of overall calm, as measured on Twitter, the Dow Jones index rose, whereas the reverse happened after an anxious period. Although the researchers claimed accuracy of around 87%, it is still under discussion whether the Twitter data alone can really produce such precise predictions; but it is generally accepted that it can act reliably as an indicator. Whether for predicting the behavior of financial markets or trends in social media, this is a great example of how small pieces of information, such as 140-character tweets, can be very valuable when looked at together.

Practitioners of business intelligence are looking at data analytics to discover patterns in their data and support their decision-making. Looking at how customers use products allows designers and engineers to understand how to improve those products. Data about user behavior provides insights into how to adapt products to users with new or better features. The study of product usage has been traditionally performed via surveys and usability studies. Users are asked to describe how they perform certain actions and tasks—How do you search for a lost e-mail in your client software, choose a hiking backpack on an e-commerce site, organize your appointments in your calendar?—and to list features that are missing. They may also be asked to use a product in a controlled environment, such as a lab, giving the product designers the opportunity to measure the effectiveness of their design decisions. But these approaches have one big drawback: they cannot be performed on a large scale. How many users can you test, and how realistic are these evaluations? Or, how much data can be collected, and how reliable is it?

Analyzing data is not only useful for analysts and scientists. Machine-learning algorithms can discover patterns and classify and categorize data as part of an application. For example, consider users and books. If you find out that Mark and Sharon have similar tastes in books, you can advise Mark to read the books that Sharon has read but Mark has not (yet), and vice versa. Or, to turn it around, if you know that two books are similar, you can advise the readers of the first to read the second, and so on. What these algorithms learn automatically from the data can be integrated directed into an application, making it more personalized and proactive. These concepts are at the core of recommender systems, such as those used by Amazon and Facebook to recommend books and friends. Or think of a search engine like Google. Google Search receives queries from users asking for specific content. It also collects which entries the users click on the results page, not to mention clicks on content provided by other Google products such as Gmail, Google Maps, and so on. Putting this information together, Google Search can provide different search results for the same query, depending on the user who performs it. If I search for a music shop, Google will probably give more relevance to shops that sell jazz music in Amsterdam. This is called personalized search.

One interesting fact about data is that the more you have, the better you can understand and model the phenomena you are looking at. In fact, studies show something even more striking. Recently, scientists have tested “naive” machine-learning algorithms against more sophisticated ones. They have discovered that often, the naive algorithms outperform the more sophisticated ones if they are fed large enough volumes of data. Consider the Netflix challenge. Netflix provided a large dataset containing the ratings of nearly half a million users for 18,000 movies. The challenge was to predict user ratings for movies they have not seen yet. These studies show that a simpler algorithm that takes into account additional data about the movies, such as information from the Internet Movie Database (IMDb), can outperform a more sophisticated algorithm that uses only the Netflix data. This example is tricky, because IMDb is a different (independent) dataset; but in general, machine-learning algorithms benefit from more data because it reduces noise and helps build better models.

Another counterintuitive point is that, given a large enough dataset, a sophisticated algorithm may perform more poorly than a naive one. Basically, more sophisticated heuristics do not help when algorithms are given a bigger picture; rather, they may achieve worse results. It is as if the additional complexity of the newer algorithms compensates for absence of evidence, and hence they work better with small datasets, but the complexity biases the results that can be achieved if more evidence is available. One way to put it is that the sophisticated algorithms can be more opinionated, instead of basing their conclusions on facts. This is good news. It means you can mine data with a fairly simple and well-understood toolset (the basic algorithms that have been studied extensively over the last 30 years), provided you can execute them at scale.

Image Important  The ability to create models based on data is the basis of creating better products and services. The volume of data being produced is already huge and is constantly growing. Unfortunately, there is no shortcut that will let you avoid facing these large volumes of data. You have to design algorithms and build platforms that can manage data at this scale. Giraph is one of these platforms.

From Big Data to Big Graphs

The previous section discussed the importance of looking at large volumes of data to extract useful intelligence. The examples presented use many small pieces of data that are aggregated to extract trends and whatnot. What is the average age of people buying a product? In what month of the year does that product sell best? Which other product has a similar trend? The more data you have, the more accurate the answers you can give to questions like these. However, for certain kinds of data, you can also take advantage of the connections between the entities in your data. In those cases, more important than the pieces of information (and their aggregations) is the way those pieces are connected—more precisely, the structure of the network created by these connections. Unfortunately, the power that comes from analyzing connected data has a cost, because analyzing graphs is usually computationally expensive and introduces new challenges and specific tools. Giraph is a framework you can use to run graph analytics on very large graphs across many commodity machines. It was specifically designed to solve problems that take the connectedness of data into account. In other words, it was designed to process graphs.

A graph is a structure that represents abstract entities (or vertices) and their relationships (or edges). Graphs are used in computer science in a variety of domains to represent data and express problems. Many common data structures like trees and linked lists are graphs: their nodes are connected with links according to the criteria that define each specific data structure. Graphs are also used in other disciplines; for example, in the social sciences, a graph is usually called a network. Networks are used to represent relationships between individuals, such as in a social network, as illustrated in Figure 1-2. As in the case of graphs in computer science, social networks were used in the social sciences long before social networking sites like Facebook became popular. And in biology, graphs are used to map interactions between proteins. Having such a connected structural model lets researchers look at data in a more cohesive way than as just a collection of single items. Chapter 2 looks at graphs in more depth and explains how to model data and problems through them.

9781484212523_Fig01-02.jpg

Figure 1-2. A social network of individuals represented with a graph

To give you a better idea of how looking at data connections as a graph can give you added value, and the types of problems for which Giraph may be helpful, let’s consider some examples. Recent research has shown that by looking at the social network of 1.3 million individuals on Facebook, it was possible to predict which people were partners and whether they were going to break up in the next two months (of course, without looking at their relationship status). The researchers used only the network structure—the friendship relationships between individuals. The concept behind this research is social dispersion. Basically, just looking at the number of friends two individuals have in common—called embeddedness—does not provide enough information and is a low predictor metric. Dispersion, on the other hand, measures how much these common friends are not connected. In other words, high dispersion between two people means they have friends in common but only a few of those friends are friends with each other. According to the data, couples with long-lasting relationships tend to present high dispersion. Intuitively, the results suggest that strong romantic relationships are those in which people participate in different social groups, which they share with their partners but which remain separate. Looking at one individual and selecting from her social network individuals with whom she has high dispersion generates a list of possible partners for that individual; about 60% of the time, the person at the top of this list is indeed the correct partner. Moreover, couples without this particular social structure are more likely to split in the near future.

Another example of how looking at data as a graph can bring added value is the Web. As you see in the following chapters, one of the reasons for Google’s initial success was that it looked at the Web as a graph. Web pages point to other pages through hyperlinks. Early on, Google’s competitors were crawling the Web by jumping from page to page and following the links on each page, but the graph structure was not used to provide search results. Results were provided solely based on the presence of query terms on web pages, which led to the spamming of popular keywords on unrelated pages. Google, on the other hand, decided to look at the structure of the graph—which pages linked to which other pages—to rank pages based on a notion of popularity. According to Google’s PageRank algorithm, the popularity of a page depends on the number of links pointing to it and how popular those pages are. This means the entire structure of the graph is taken into account to define the ranking of vertices. This approach was more sophisticated, but the results were impressive, and you know how the story continued.

A final example relates to the temporal aspect of networks. A couple of years ago, researchers used a network representing sex buyers and escorts to study the spread of diseases. This dataset was a graph in which connections between individuals represented sexual intercourse; it also contained a temporal aspect, because connections had a timestamp. By studying the structure of the graph and how it developed over time, researchers were able to simulate epidemics and study how they developed. If they simulated some of the individuals being affected by a sexually transmitted infection and looked at the connections, how many people might be eventually infected, and how did the structure of the graph affect this outcome? Interestingly, similar models have been applied to online social networks to study how information is spread via networks like Twitter and Facebook—a process that by no coincidence is often called viral.

Image Important  The Web, online social networks, the relationships among users who rate movies in a movie database, and biological networks are only a few examples of graphs. Whether in the context of business or science, viewing data as connected adds value to it. Analyzing these connections can help you extract intelligence in ways that are not possible otherwise.

Unfortunately, looking at data as graphs often requires computationally expensive algorithms as the larger, interleaved nature of data is taken into account. Moreover, graphs are becoming larger and larger. For example, the Facebook social graph contains more than 1 billion users, each with around 200 friendship relationships. The Web, as indexed by Google, contains more than 50 trillion pages. These graphs are huge. If you want to extract information out of the interleaved connections between entities in your data with a simple but powerful programming paradigm, read on: Giraph is probably the right tool for your use case.

Why Giraph?

Some of the things mentioned in the previous sections are not exactly new. Big Data is a movement that has grown during the last decade from a problem faced by a few (Google, Yahoo!, Facebook, and so on) to a complete paradigm shift. Initially, Big Data was characterized by the design of new solutions that overcame the limitations of traditional ones when managing large volumes of data. The following three technical challenges come with managing data at a large scale:

  • Data is dynamic, and solutions must support high rates of updates and retrievals. Think of a web application with millions of users clicking links and requesting pages every second.
  • Data is large, and its management must be distributed across a number of machines. Think of petabytes of data that are too large to be stored and processed by a single machine.
  • The computing environment can be dynamic as well, and solutions have to manage data reliably across a number of machines that can and will fail at any moment. Think of a cluster of thousands of commodity machines, possibly distributed across different data centers.

Big Data solutions must face these challenges to support applications at large scale.

Managing Big Graphs requires solving these challenges and more. Graphs add more problem-specific challenges that Giraph was designed to handle:

  • Graph algorithms are often expressed as iterative computations, and a graph-processing system that supports such algorithms needs to perform iterations efficiently.
  • Graph algorithms tend to be explorative, which introduces random access to your data that is difficult to predict.
  • Usually, a small computation is executed multiple times on each graph entity, and such operations should be simple to express and parallelize.

This book looks at how Giraph attacks these specific challenges technically. For now, consider that Giraph was designed with graphs and graph algorithms in mind, and as such it can execute graph algorithms up to 100 times faster than other Big Data frameworks like MapReduce.

Image Definition  Giraph is a framework designed to execute iterative graph algorithms that can be easily parallelized across hundreds of commodity machines. Giraph is fault-tolerant and easy to program, and it can process graphs of massive scale.

Although it is important to execute graph algorithms efficiently and reliably, it is just as important that the programming paradigm be simple and not require time-consuming and error-prone operations. Giraph offers what is called a vertex-centric programming model. This model requires you to put yourself in the shoes of a vertex that can exchange messages with the other vertices in a sequence of iterations. None of the programming complexity of a large-scale parallel and distributed system is exposed to the user. The programmer concentrates on the problem-specific aspects of the algorithm, and Giraph executes the algorithm in parallel across the available machines transparently. In practice, this means extending a class and implementing a single function. Often, Giraph algorithms are fewer than 50 lines and do not require any concurrent primitives such as locks, semaphores, or atomic operations.

Image Important  Giraph provides a vertex-centric programming model that requires the developer to think like a vertex that can exchange messages with other vertices. The programming model hides the complexity of programming a parallel and distributed system. Giraph executes your code transparently in parallel across the available machines.

Giraph was designed to compute graph analytics and social-network analysis. As you see in the next chapter, these are often executed as offline computations. This means Giraph applications usually sit in the back office, are run periodically over large datasets, and take minutes or hours. Giraph is designed for heavy-lifting tasks, not for quick, interactive queries.

Let’s take a quick look at an application architecture (in the broadest sense) and where Giraph would fit. Chapter 2 looks at this in more depth. Figure 1-3 shows such an architecture. As you can see, the architecture is divided into the three typical macro components: the front end, the back end, and the back office. The front end is where the application’s client-related components are running and includes the mobile application, the code that runs in the desktop browser, and so on. The back end is where the application servers, the databases, and all the (distributed) infrastructure run, in cooperation with the front end, to give the user a unified experience when they interact with the application. In the case of an application like Facebook, these two components are responsible for providing the data and the logic when you click the buttons of social-networking features like surfing your newsfeed, looking at your friends’ activity and so on. In the back end are your pictures, your comments and likes, and the social graph. In the back office resides the logic that is executed periodically to compute and materialize offline the content that is used online by the application. Giraph lives in the back office and is executed to compute application logic like friend recommendations, ranking of activity items, and so forth. It is also used on demand by the data-science team to run analytics on user-activity data collected by the back end. For example, Giraph is used to execute analyses like that mentioned earlier, seeing whether partnership relationships can be inferred from the social graph through the social-dispersion metric.

9781484212523_Fig01-03.jpg

Figure 1-3. An application’s architecture and where Giraph fits in

To summarize, Giraph is used to run expensive computations that are executed asynchronously with respect to the interaction between the user and the application, and for this reason it is said to live in the back office. The following sections explore how Giraph integrates with technology that resides in this component of the application architecture and how it differs from other tools that are positioned both in in the back office and in the back end.

Image Important  Giraph resides in the back office. It is used to compute offline graph analytics that are run periodically on data collected from interactions between the front end and the back end, along with all the additional data that may be available in the back office.

GOOGLE PREGEL AND APACHE GIRAPH

Many Apache projects under the Hadoop umbrella are heavily inspired by Google technologies. Hadoop originally consisted of the Hadoop Distributed File System and the MapReduce framework. These two systems were both inspired by two Google technologies: the Google File System (GFS) and Google MapReduce, described in two articles published in 2003–2004.4 In 2010, Google published an article about a large-scale graph-processing system called Pregel.5 Apache Giraph is heavily inspired by Pregel.

Giraph was initially developed at Yahoo! and was incubated at the Apache Foundation during the summer of 2011. In 2012, Giraph was promoted to an Apache Top-Level Project. Giraph is an (open source) loose implementation of Google Pregel, and it is released under the Apache License. In 2013, version 1.0 was released as proof of its stability and the number of features added since its initial release.

Giraph enlists contributors from Facebook, Twitter, and LinkedIn, and it is currently used in production at some of these companies and others. Giraph shares with Pregel its computational model and its programming paradigm, but it extends the Pregel API and architecture by introducing a master compute function, out-of-core capabilities, and so on, and removing the single point of failure (SPoF) represented by the master. You see all these functionalities in detail throughout the book.

Giraph and the Hadoop Ecosystem

Hadoop is a popular platform for the management of Big Data. It has an active community and a high adoption rate with yearly growth of around 60%,6 and a number of companies make supporting these enterprises their mission. The global Hadoop market is predicted to grow to billions of dollars in the next five years, and skill with Hadoop is considered one of the hottest competitive factors in the ICT job market.7

Hadoop started as an implementation of the Google distributed filesystem and the MapReduce framework in the Apache Nutch project. Over the years, it has turned into an independent Apache project. Nowadays, it is much more. It has evolved into a full-blown ecosystem. Under its umbrella are projects beyond the Hadoop Distributed File System and the MapReduce framework. All these systems were developed to tackle different challenges related to managing large amounts of data, from storage to processing and more. Giraph is a relatively newcomer to the Hadoop ecosystem.

Image Important  Although Giraph is part of the Hadoop ecosystem and runs on Hadoop, it does not require you to be an expert in Hadoop. You just need to be able to start a Hadoop machine (or cluster).

Here is a selection of projects related to Giraph. Do not be worried if you do not know (or do not use) some of these tools. The aim of this list is to show you that Giraph can cooperate with many popular projects of the Hadoop ecosystem:

  • MapReduce: A programming model and a system for the processing of large data sets. It is based on scanning files in parallel across multiple machines, modelling data as keys and values. Giraph can run on Hadoop as a MapReduce job.
  • Hadoop Distributed File System (HDFS): A distributed filesystem to store data across multiple machines. It provides high throughput and fault tolerance. Giraph can read data from and write data to HDFS.
  • ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Giraph uses ZooKeeper to coordinate its computations and to provide reliability and fault tolerance.
  • HBase: The Hadoop database. It is a scalable and reliable data store, supporting low-latency random reads and high write throughput. It has a flexible column-based data model. Giraph can read data from HBase.
  • Cassandra: A distributed database that focuses on reliability and scalability. Cassandra is fully decentralized and has no SPoF. Giraph can read data from Cassandra.
  • Hive: A data warehouse for Hadoop. It lets you express ad hoc queries and analytics for large data sets with a high-level programming language similar to SQL. Giraph can read data from a Hive table.
  • HCatalog: A table- and storage-management service for data created using Apache Hadoop. It operates with Hive, Pig, and MapReduce. Giraph can read data stored through HCatalog.
  • Gora: Middleware that provides an in-memory data model and persistence for Big Data. Gora supports persisting to column stores, key-value stores, document stores, and RDBMSs. Through Gora, Giraph can read and write graph data from and to any data store supported by Gora.
  • Hama: A pure bulk synchronous parallel (BSP) computing framework on top of HDFS for massive scientific computations such as matrix, graph, and network algorithms. Giraph was designed to solve iterative algorithms like Hama. Unlike Hama, Giraph does not require additional software to be installed on a Hadoop cluster.
  • Mahout: A library of scalable machine-learning algorithms on top of Hadoop. It contains algorithms for clustering, classification, recommendations, and mining of frequent items. Giraph can run machine-learning algorithms designed specifically for graphs.
  • Nutch: Software for web search applications. It consists of a web crawler and facilities for ranking and indexing web pages. Nutch can use Giraph to compute the rankings of web pages.

As you can see, these projects are very different, from databases to processing tools. But they share a common characteristic: they are designed for a large scale. Giraph fits in the list by providing a programming model and a system for processing large graphs. You can think of Giraph as a MapReduce that is specific to graph algorithms. Giraph uses existing Hadoop installations by running as a MapReduce job or as an Apache Hadoop YARN (Yet Another Resource Negotiator) application, where a YARN-based Hadoop installation is available. Giraph can read data from HDFS, HBase, HCatalog, and Hive, and it uses ZooKeeper to coordinate computation. Collaboration also happens also in the reverse direction: as mentioned, Nutch can use Giraph to compute the rankings of web pages. You see more about how to integrate Giraph with the rest of the ecosystem throughout this book. Figure 1-4 shows the high-level architecture of Giraph and how it uses different Hadoop projects to work.

9781484212523_Fig01-04.jpg

Figure 1-4. Architecture of Giraph and how it integrates with other Hadoop projects

Image Important  If you already have a Hadoop infrastructure in place or plan to deploy more in the future, Giraph will play well with most of it.

Giraph and Other Graph-Processing Tools

It is important to understand what distinguishes Giraph from other tools on the graph-processing scene in order to make the right decision about when to use it. Here’s a simple heuristic. If your computation requires touching a small portion of the graph (for example, a few hundred vertices) to return results within milliseconds, maybe because you need to support an interactive application, then you should look at a graph database like Neo4j, OrientDB, or the TinkerPop tools. On the other hand, if you expect your computations to run for minutes or hours because they require exploring the whole graph, Giraph is the right tool for the job. If your computations are aggregating values extracted from scanning text files or computing SQL-like joins between large tables, then you should look at tools like MapReduce, Pig, and Hive.

In particular, we want to stress the difference between a tool like Giraph and a graph database. You can use a proportion to understand the relationship between Giraph and a graph database like Neo4j: Giraph is to graph databases as MapReduce is to a NoSQL database like HBase or MongoDB (or a relational database like MySQL). In the same way MapReduce is usually used to run expensive analytics on data represented through tuples, Giraph can run graph-mining algorithms on data represented through a graph. In both cases, MapReduce and Giraph run analytics on the data stored in the databases. In both cases, the databases are used to serve low-latency queries: for example, to support an interactive application. What differs is the data model used to represent data. This analogy works well for presentation purposes, but don’t let it mislead you. Giraph can process also data stored in a NoSQL database, and MapReduce can do the same with graph databases.

Image Tip  You can think of the relationship between Giraph and a graph database as the following proportion: Giraph : graph database = MapReduce : NoSQL database.

This is a very simple yet effective heuristic. But let’s try to make things even clearer. The emergence of Big Data has created a set of new tools. Hadoop is an ecosystem that hosts some of them, but more are available. The landscape of these tools is usually divided into two groups: databases for storing data and serving queries with low latency, and tools for long-lasting computations such as analytics. These two groups of tools have different use cases and requirements, which lead to different design decisions. Databases focus on serving thousands of queries per second that require milliseconds (or a few seconds) to return results. Think of a database backing the content of a web page after a click. Tools for analytics focus on serving a few computation requests per minute, each requiring minutes or hours to return results. Think of computing communities in a large social network. Giraph positions itself in this second group, because it was designed to run long-lasting computations that require analyzing the entire graph with multiple passes. If you want to learn more about databases for graphs, look at systems like Neo4j, OrientDB, and the TinkerPop community.

Image Tip  If you are interested in getting to know more about graph databases like Neo4J and OrientDB, you can check out Practical Neo4j, by Greg Jordan (Apress 2014).

All these tools specialize in various data models: key-value pairs, columns, documents, or graphs. Figure 1-5 provides a visual representation of some of them. Of these, graphs are probably the most peculiar. Compared to simple key-value pairs and columns, graphs have a more unstructured and interleaved nature. Moreover, graph algorithms tend to be explorative, where portions of the graph are visited to answer a specific query (compared to a bunch of specific key-value pairs, columns, or documents being retrieved). For these reasons, graph-processing systems require design decisions that take into account the specific characteristics of graphs and their algorithms. Running graph algorithms with general-purpose systems is possible—for example, on systems based on tuples, such as relational databases or MapReduce—but these solutions tend to perform worse.

9781484212523_Fig01-05.jpg

Figure 1-5. Comparing other data models to a graph

Giraph was designed for graphs since the beginning. And it is not the only such tool. Other systems similar to Giraph are GraphLab and GraphX. Of these, Giraph is the only one that runs transparently on a Hadoop cluster and has a large open source community. GraphX is perhaps most similar to Giraph; although it offers a simplified API for graph processing, it requires implementing your algorithm in Scala with a different API than that provided by Giraph (which is arguably less simple), it requires Spark, and it is generally slower than Giraph. GraphLab (now known as Dato Core or GraphLab Create) used to be an open source solution, but it is now proprietary. Figure 1-6 gives you an idea of how Giraph compares to other data-management systems.

9781484212523_Fig01-06.jpg

Figure 1-6. Positioning Giraph with respect to other data-management tools

To summarize, Giraph is a good tool for these tasks:

  • Analyzing a large set of connected data across a cluster of commodity machines (running Hadoop)
  • Running an algorithm that is compute-intensive and that processes your graph iteratively, perhaps in multiple passes
  • Periodically running an algorithm offline in the back end

And these are the tasks Giraph is not good for:

  • Running aggregations on a set of unconnected or unrelated pieces of data
  • Working with a small graph that can easily fit on a single machine
  • Running a computation that is expected to touch a small portion of the graph—for example, to support an interactive workload like a web application

Summary

Big Data and data analytics are opening new ways of collecting traces and understanding user behavior. Mastering techniques to analyze this data allows for better products and services. Looking at aggregations of many small chunks of data provides insights into the data, but examining the way these chunks are connected enables you to take a further step into better understanding your data.

In this chapter, you learned the following:

  • A graph is a neat and flexible representation of entities and their relationships. Graphs can be used to represent social networks, the Internet, the Web, and the relationships between items such as products and customers.
  • Processing large graphs introduces specific challenges that are not tackled successfully by traditional tools for data analytics. Graph analytics requires tools specifically designed for the task.
  • Giraph is a framework for analyzing large data sets represented through graphs. It has been designed to run graph computations reliably across a number of commodity machines.
  • Unlike graph databases, Giraph is a batch-processing system. It was designed to run computationally expensive computations that analyze the entire graph for analytics, not to run queries expected to be computed within milliseconds.
  • Giraph is part of the Hadoop ecosystem. Giraph jobs can run on existing Hadoop clusters without installing additional software. Giraph plays well with its teammates, and it can read and write data from a number of data stores that are part of the Hadoop ecosystem.

Now that you have been introduced the general problem of processing large volumes of connected data and the ecosystems of Big Data and graph processing, you are ready to look at how to model data as graphs, how graphs can fit in real-world use cases, and how Giraph can help you solve the related graph-analytical problems.

________________________

1Charles Duhigg, “How Companies Learn Your Secrets,” New York Times, Feb. 16, 2012, www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

2Mark Weiser, “The Computer for the Twenty-First Century,” Scientific American, 1991.

3A zettabyte is 1 trillion gigabytes. Source: John Gantz and David Reinsel, IDC, “Extracting Value from Chaos,” June 2011, www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.

4Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003.Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM 51.1 (2008): 107–113.

5Grzegorz Malewicz, Matthew H. Austern, et. al., “Pregel: A System for Large-Scale Graph Processing.” Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 2010.

6Wall Street Journal, CIO Report, “Hadoop There It Is: Big Data Tech Gaining Traction.” April 12, 2014. For WSJ subscribers, http://blogs.wsj.com/cio/2014/04/12/hadoop-there-it-is-big-data-tech-gaining-traction/.

7“Job Trends,” Indeed, www.indeed.com/jobtrends.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset