CHAPTER 2

image

Modeling Graph Processing Use Cases

This chapter covers

  • Modeling your data as a graph
  • Differences between offline and online computations
  • Fitting Giraph in an application architecture
  • Use case for Giraph at a web-search company
  • Use case for Giraph at an e-commerce site
  • Use case for Giraph at an online social networking site

Giraph goes beyond a basic library to facilitate graph processing. It provides a programming model and a framework to solve graph problems through offline computations at large scale. To use Giraph, you need to understand how to model data with graphs and how to fit Giraph in a system architecture to process graph data as part of an application. For this reason, this chapter is dedicated to three topics. First it introduces graphs and shows how you can use graphs to model data in a variety of domains. Then you learn the difference between offline and online computations to help you identify the types of processing for which Giraph is a good solution. The remainder of the chapter focuses on fitting Giraph into a system architecture. To achieve this goal, the chapter presents three uses cases based on real-world scenarios.

Graphs Are Everywhere

A graph is a neat, flexible structure to represent entities and their relationships. You can represent different things through a graph: a computer network, a social network, the Internet, interactions between proteins, a transportation network—in general terms, data. What do these examples have in common? They are all composed of entities connected by some kind of relationship. A computer network is composed of connected devices, a social network is a network of people connected by social relationships (friends, family members, co-workers, and so on), cities are connected by roads, and neurons are connected by synapses. The World Wide Web is probably one of the clearest examples: pages, images, and other resources are connected through the href attribute of links.

Image Definition  A graph is a structure to represent entities that are connected through relationships.

A graph can be used to represent these connected structures. It consists of vertices (or nodes) and the edges (or arcs, or links) that connect them. The two vertices connected by an edge are usually called the edge endpoints. The vertices connected to a specific vertex, through the respective edges, are called the vertex neighbors. An edge that connects a vertex to itself is called a loop. In diagrams, vertices are usually represented by circles and edges are represented by lines. Figure 2-1 shows these fundamental concepts; in its most basic form, there is nothing more to a graph.

9781484212523_Fig02-01.jpg

Figure 2-1. Basic concepts of a graph

This section guides you through several examples of how to model different data using a graph. Each example introduces a number of graph concepts. Let’s start with a network of computers.

Modeling a Computer Network with a Simple, Undirected Graph

A computer network comprises a collection of computing devices such as desktops, servers, workstations, routers, mobile phones, and so on, connected by means of a communication medium. Figure 2-2 shows an example of a computer network. Ignore the technology or protocol that enables the devices to communicate (such as Ethernet, WiFi, or a cellular network), and focus on the fact that they are connected and can exchange data. This is the fundamental aspect that makes a computer network a network and a graph.

9781484212523_Fig02-02.jpg

Figure 2-2. A computer network

It is natural to model a computer network with a graph by representing computers as vertices and by connecting two vertices through an edge if the two computers they stand for are connected. Clearly, there is not one single correct way to model a computer network through a graph; it is up to the modeler to decide, for example, whether hubs and switches should be represented in the graph, or which layer of the TCP/IP stack should be considered. To identify each device in a graph, you assign each vertex a unique identifier (ID); in the case of a computer network, you can use the device hostname or an IP address. Each data domain has a natural data element that can be used as an ID for vertices, such as a passport number for a social network, a URL for a web page, and so on.

Figure 2-3 shows a graph representing the computer network from Figure 2-2. In this example, the IP address of each computer is used as the ID of the vertex, and two computers are connected by an edge if they can communicate directly at the network layer (hence ignoring switches, hubs, access points, and so on).

9781484212523_Fig02-03.jpg

Figure 2-3. A graph representing a computer network

Modeling a Social Network and Relationships

Let’s move to another example: a social network. A social network consists of a number of individuals connected by some kind of relationship—friends, co-workers, and so on. You can easily represent a social network using the graph concepts introduced so far, with vertices representing individuals and edges connecting two vertices if the two individuals they stand for are socially tied. But how can you specify the particular type of relationship between two individuals? You can use a label. A label identifies the type of relationship, and it can be attached to an edge. Figure 2-4 shows a graph representing a social network of five individuals. The individual’s first name is the ID, and a label is attached to each edge to qualify the type of relationship.

9781484212523_Fig02-04.jpg

Figure 2-4. A graph representing a social network

Until now, you have only considered symmetric relationships, such as friendship. However, edges can be extended with direction. The direction defines the source vertex and the destination vertex of an edge. An edge with direction is called a directed edge. All edges that have a specific vertex as a source are called its outgoing edges, and those that have that vertex as a destination are called its incoming edges. One of the differences between the Twitter and Facebook social graphs is the result ot this aspect. For Facebook, a friendship relationship is symmetric (it requires confirmation by both parties), and it is better represented through an undirected edge. On the other hand, Twitter lets you specify a follower relationship that does not have to be reciprocated; hence a directed edge is more appropriate. Again, whether a relationship should be modeled through direction depends on the relationship, and the decision is in the hands of the modeler. In plots, directed edges are usually represented by an arrow. A graph with directed edges is called a directed graph, and a graph with undirected edges is called an undirected graph. Figure 2-5 shows the same network of five individuals, this time modeled using a directed graph.

9781484212523_Fig02-05.jpg

Figure 2-5. A directed graph representing a social network

Modeling Semantic Graphs with Multigraphs

All the graphs so far allow only one edge between a specific pair of vertices in the undirected case (only one friend edge between Mark and Sarah), and two in the directed case (one edge for John father Paul and one edge for Paul son John). Multigraphs, on the other hand, allow multiple edges between the same pair of vertices, in both the directed and undirected cases, and can also support labels attached to each edge.

A good example of a labeled directed multigraph is a resource description framework (RDF) graph. According to the RDF data model, information can be described through a series of triples, each composed of a subject, a predicate, and an object. Ignoring the specifics such as syntax notations, serialization, and so on, RDF can be used to represent knowledge about different domains. Imagine that you want to represent the fact that Rome is the capital of Italy. (“Rome”, “is capital of”, “Italy”) is a valid RDF triple to represent such information, where “Rome” is the subject, “is capital of” is the predicate, and “Italy” is the object of the triple. (“Resource Description Framework”, “has abbreviation”, “RDF”) is another example of a triple. If you think about it, each of these triples is nothing more than a labeled directed edge, where the subject is the source vertex, the predicate is the label, and the object is the destination vertex. Having such a general and flexible way of describing concepts unleashes your ability to represent pretty much anything that can be expressed through a triple. RDF graphs are often referred to as semantic networks or graphs (also making RDF one of the core components of the Semantic Web), because they are frequently used to describe the semantics of things through their relationships.

DBpedia is an example of such a semantic graph. DBpedia is an effort to represent the structured information in Wikipedia—for example, in the info boxes—in the form of a graph. Figure 2-6 lists a number of (simplified) triples from DBpedia. Constructing a graph from this table is straightforward: each triple can be represented by an outgoing edge, leaving from the vertex representing the subject and ending in the vertex representing the object. Subject and object labels can be used for vertices IDs, and predicate labels can be represented by the edge labels. Figure 2-7 shows such a graph.

9781484212523_Fig02-06.jpg

Figure 2-6. A set of triples from DBpedia

9781484212523_Fig02-07.jpg

Figure 2-7. A multigraph constructed from a set of triples from DBpedia

Notice how all the triples that have Barack Obama as a subject result in an outgoing edge for the vertex representing Barack Obama. Also, notice how Barack Obama has two outgoing edges (tenantOf and residence) toward White House. The latter is an example of the multigraph property mentioned before.

Modeling Street Maps with Graphs and Weights

Street maps are good examples of graphs as well. Perhaps the first person who ever drew a map was also the first person who drew a graph. Modeling a street map with a graph is intuitive: cities, towns, and villages are modeled with vertices, and the roads and streets that connect them are modeled with edges. In general, any point where a different road can be taken is modeled with a vertex. Note that although streets cross, edges do not (they do when you draw them in two-dimensional space, but they do not conceptually). Hence, crossings must be explicitly modeled through vertices. (This will be clearer when you look at paths in chapter 4.) In a graph representing a street map, the edges can have the type (a highway or a road) or the number (a combination of both) as the label, and direction can be used to model the way the street can be followed (a one-way road). How can you model the length of the road? Clearly, the distance between two cities does not depend solely on the number of roads—and, hence, edges—that need to be followed, but also on the length of these roads.

You can use weights for this purpose. Weights are numerical properties of edges that can be used to represent quantitative properties of relationships. For example, they can represent the distance between two towns, the ranking of a movie by a user, the similarity between two users based on some profile data, or the strength of a social tie. Algorithms exploit weights to compute shortest paths, recommendations, and different metrics of the graph. You learn more about weights when you look at paths in chapter 4. Figure 2-8 shows a highway map of Italy and a portion of it modeled with a graph.

9781484212523_Fig02-08.jpg

Figure 2-8. A transportation network modeled through a graph with weights

Graphs are flexible structures that are natural to think about, and their modeling strongly depends on their purpose. Software engineers use these structures in one form or another on a daily basis. Think of an ER diagram, an UML diagram, or a dependency tree; these are all graphs. Most of us naturally draw graphs when discussing a design or an architecture during brainstorming. That is why graphs are also said to be whiteboard friendly. Most data-structures are some kind of graph, from trees to linked lists. But data, data representations, and data structures are meaningful only when coupled with the algorithms that compute them. The following sections and chapter 4 look at what you can do with graphs. Table 2-1 summarizes the concepts presented in this section.

Table 2-1. Core Concepts of Graphs

Name

Description

Vertex

An entity in a graph

Identifier (ID)

The unique identifier of a vertex

Edge

A relationship between two entities

Edge label

A qualitative property of an edge (for example, a type)

Edge direction

Specifies the source and destination vertices of an edge

Edge weight

A quantitative (numerical) property of a relationship

Edge endpoints

The two vertices connected by an edge

Loop

An edge connecting a vertex to itself

Neighbors / Neighborhood

The vertices connected to a specific vertex

Directed graph

A graph comprising directed edges

Undirected graph

A graph comprising undirected edges

Multigraph

A graph allowing multiple edges between two vertices

Comparing Online and Offline Computations

Chapter 1 discussed Giraph with respect to graph databases and other tools for graph processing. You have seen Giraph positioned as a tool for graph analytics, and the differences between offline graph analytics and online interactive queries. This section digs deeper into these concepts. To better understand the differences between the two workloads, you explore an analogy with search in a filesystem with the Unix commands find and locate. After explaining the filesystem scenario, this section maps the concepts to Giraph and graph analytics.

One of the actions you probably perform often when using a computer is searching for files. For example, you may look for all the files that have a name starting with a particular prefix, or those created after a certain date. This does not mean a search based on the content of the file, which is performed differently. If you are a Unix users who uses the command line to perform searches, you must know the find command. find traverses a filesystem tree starting from a particular directory and returns all the file names that match a given pattern. To perform the pattern matching, find must visit each file and directory under the starting root and apply the matching function. This can be a costly operation, in particular because it may create extensive IO. Moreover, if you perform two or more consecutive searches, find must perform everything from scratch each time.

To avoid redoing all the work for each search, the locate command relies on its own database. It can be used like find to search for files matching a particular pattern. The database stores all the metadata information about the files and directories in the filesystem in a format that makes searches efficient. As files are created, deleted, and updated, periodically locate.updatedb is run in the background to scan the filesystem and update the locate database. There is clearly a sweet spot between how often to run the expensive indexing procedure in the background, the rate of file changes, and how often searches are performed. You do not want to re-index the database more often than the files are updated or searched, and you do not want the database to contain stale metadata. For this reason, if you work with few files in your home directory and modify them often, it makes more sense to rely on the find command and search only in the home directory. Figure 2-9 shows the differences between the two commands.

9781484212523_Fig02-09.jpg

Figure 2-9. Searches in a filesystem with find and locate

The point is that for certain operations, you do not want to wait a long time before you get results. Such applications typically interact with the user or with other applications that need to make decisions in a short amount of time. As in the previous example, traversing an entire filesystem is an expensive operation because it requires processing large amounts of data and hence calls for periodic background computations aimed at making the interactive ones faster. These computations process indices and other data structures that allow the interactive operations to perform quickly. In the previous example, the locate command performs the interactive lookup in the database to return the list of matching files. It has also another command, locate.updatedb, that is run periodically and updates the database in the background. The side effect of this approach is that the application may not always have access to the latest version of the data, because the data may have changed after the databases was updated. For these cases, it is still necessary to compute the results for every query. However, to return results quickly, the query must be a fast operation that touches a small portion of the data.

In graph terms, a query to be performed quickly might be asking for the names of the friends of a particular user in a social network. It could also be something more complex that requires exploring a small portion of the graph that goes beyond the neighborhood of a vertex. Think of a dataset such as IMDb. You could ask the average age of actresses who starred in a movie set in France with Brad Pitt. Because the number of movies and actresses connected to Brad Pitt is quite small compared to the whole database of movies, this query should be performed quickly by a (graph) database. This is similar to using the find command in a specific small directory of the filesystem. However, if you want to ask the degree of separation between Brad Pitt and any other actor in the movie industry—supposing two actors are connected if they appeared in the same movie—this cannot be computed quickly with a database the size of IMDb. It requires exploring the entire graph and reaching every other actor in the graph, just like exploring the entire filesystem starting from the / directory. If you want to query the degree of separation between any two actors frequently, and you want the results quickly, you have to compute all the results in advance and store them in a database. Once they are computed, they can be looked up at query time. This process is similar to the functioning of the locate command and its background command, locate.updatedb.

Following this analogy, the ecosystem of data processing is divided into online and offline systems: online systems are designed to compute queries that are expected to conclude within seconds or milliseconds, and offline systems are designed to compute analyses that are expected to end, due to their size, in minutes, hours, or even days (see Figure 2-10). These systems have roles similar to the example using find and locate. Online systems are generally used to build interactive applications. Typical examples of these systems are databases like MySQL, Neo4J, and so on. They back applications that serve web applications, enterprise resource planning (ERP) systems, or any kind of software or service that can receive many short requests per second. Offline systems are generally used to compute analytics and batch computations. Batch computations are jobs that process batches of data without the need for human intervention. As mentioned earlier, Giraph is of this kind.

9781484212523_Fig02-10.jpg

Figure 2-10. Difference between online and offline workloads

Now that you know the difference between online and offline systems, you are ready to see the architecture of an application that includes both and how each type of systems fits into this architecture. By looking at the architecture, you can learn how to position Giraph in the back end.

Fitting Giraph in an Application

As you saw in the previous section, online and offline systems can work together and are often both part of an application. This section presents a stereotypical architecture that is refined with more practical examples in the following sections. By the end of this section, you will learn how to position Giraph in your own architecture. Figure 2-11 shows a general architecture for an application composed of a number of different systems; let’s zoom in in the components that are part of it.

9781484212523_Fig02-11.jpg

Figure 2-11. System architecture for a stereotypical application

The architecture is divided in three main components:

  • Front end: The part of the system where the client interacts with the system. Depending on the client, the front end can insert data either directly into the database or indirectly by interacting with the application server, which inserts data in the database as a result. The application server inserts data into and reads data from the database to serve content to the front end. This is why a double arrow connects the database and the application server. In the front end, human clients are considered along with sensors, other applications, and so on.
  • Back end: The part of the system where the application servers and database reside. The data and logic of the application and the online systems live here. The application servers compute the replies for client requests, depending on the type of request, the application logic, and the related data stored in the database. The back end can have multiple types of application servers and databases, depending on the number of applications running, the different architectures, and so on. The back end typically includes a plethora of different systems and technologies.
  • Back office: The part of the system that is less interactive. It is also the component where human intervention is more involved. For example, in the back office, the teams responsible for data entry insert data into the database, and the data teams compute analytics to study the collected data and get insights about customer behavior. The offline systems reside here as well. Data collected from application logs in the form of raw logs, or stored in a structured format in the database, is crunched to compute analytics or to materialize new information to update the collections in the database. The materialized information is precisely the kind of data that is too expensive to compute online, and hence those results are computed periodically and cached in the databases in the back end.

Let’s take a moment to look at an example and better understand how these three components of the architecture—and the systems that compose them—interact. The example uses a simple application for the sake of presentation; it is by no means the “best” possible architecture for this application, so bear with us. The definitions and borders in these kinds of architectures are also a bit fuzzy, so concentrate on the bigger picture.

Imagine a user interacting with a news site through her mobile phone. When the browser loads the homepage for the first time, it receives an HTML page, including a JavaScript script from the web server. The browser, the HTML page, and the script compose the front end. As the user scrolls the page and clicks news items, the script in the front end issues requests to the application server. Depending on the requests resulting from the user actions, the application server retrieves data from the database and different content stores: the content of the article and related multimedia. It then sends the data back to the browser, which updates the user interface so it displays the news items. The application servers, database and data stores, and, eventually, caches and message brokers compose the back end and run on the servers of the news site. Figure 2-12 shows the interaction between the front end and the back end.

9781484212523_Fig02-12.jpg

Figure 2-12. Low-latency interaction between front end and back end

When the browser requests the news items the client has clicked, the application server fetches and sends, along with the news items, recommendations for similar news articles. This feature helps the user find similar content and surf the news site. Finding similar news articles is excessively computationally expensive to be computed online at each client request, so recommendations are computed periodically offline (perhaps at night) and stored in a dedicated data store. The recommender system that computes the “similar articles” feature is in the back office, along with operations related to the providers of the news content, such as journalists and editors. To find similar articles, the system looks at the application server logs, using the “those who read this article also read those articles” paradigm (processing the clicks in the logs). Hence, the back office is connected to the back end to retrieve user-behavior data to compute the recommendations, and to store the recommendations to be served to the user. Figure 2-13 shows interaction between the back end and the back office. Note that interaction between the front end and the back end, and interaction between the back end and the back office, happen independently and at different time scales.

9781484212523_Fig02-13.jpg

Figure 2-13. Periodic interaction between the back office and the back end

Giraph is positioned in the back office. It can be part of a number of pipelines, assuming the application is involved with graph data. A pipeline is nothing more than a sequence of jobs, where the output of the previous job acts as the input to the next one. Different pipelines can be computed in parallel or sequentially, depending on the workflows and the requests of the different teams. In fact, there are multiple pipelines of batch jobs executed in the back office, some to support certain application features, and others, for example, to support analytical queries from data scientists. A lot depends on the kind of results demanded at any given moment and, of course, on the applications. However, a general pattern—a common set of steps—can be identified. The following sections look at these steps in the context of real-world scenarios.

Figure 2-14 shows this pipeline where graph data is involved. The pipeline consists of the following steps.

  1. Input: Data is collected from different sources. It is stored in different formats, sometimes following structured schemes, sometimes in simple formats such as raw text. This step of the pipeline can be implemented by storing data in a database or log files in a filesystem. In large architectures, it may require the deployment of a distributed system to collect logs streaming in real time from the application servers. In the “similar news articles” application, the inputs can be the articles themselves (for example, if articles are recommended because they have similar content), click logs coming from the application servers (if clicks are used to find similar articles), or additional metadata provided by the content provider (article categories, tags, and so on).
  2. Construction: Data is collected in various formats and schemes, and it has not yet been processed to represent a graph. This data needs to be filtered, transformed, and aggregated until a graph is extracted. This step depends both on the type of data to be processed and on the algorithms to be computed in the following steps. Different graph algorithms often require different graph representations and different data types. In the news site example, articles can be represented by vertices, and they can be connected with an edge any time a user clicks from one article to another. The edge weight can represent the (normalized) number of times this has happened. A threshold can be applied to edge weights to avoid the graph becoming too dense.
  3. Computations: Once the graph is constructed, it can be processed with Giraph. In this phase, one or multiple jobs are executed. The type and number of graph computations strongly depend on the mining application. Usually, graph algorithms are layered and built on top of each other. For example, some ranking algorithms use centrality measures, and those depend on the computation of shortest paths. Often the graph computed by one job is the input of the following one. For the news articles, two computations could be performed on the constructed graph. To compute the most popular articles, you could run PageRank on the clicks graph and use the rankings to sort the recommendations (such as most popular first). To compute which articles are related to another, you could run a community-detection algorithm on the clicks graph. Articles falling in the same community could be considered related and hence be recommended. You see both algorithms in chapter 4.
  4. Fusion: The results of the computations performed in the previous step are transformed and joined or fusioned with other data, possibly coming from the database or from other pipelines. Keep in mind that graph computations are often part of a bigger application and hence may be solving only a sub-problem. Moreover, this step depends on the type of analytics being performed and on the type of materialization necessary in the databases. Fusion may actually require multiple steps, because the results of the computation may be reused for different reports and databases. For example, the rankings of the articles and their category computed through the community-detection algorithm could be joined along with the article metadata to prepare the injection into the recommendations data store.
  5. Output: Once the data has been fusioned and packaged into a report or a materialized view, it is sent to the teams or injected into the respective databases. In the case of a report, the output can be a summarized view of the results; or, in the case of a materialized view, the output can be a data table to be injected into a database. In the latter case, the output phase can produce consistent load on the databases where the data is injected. For this reason, the injected data should be in a format that does not require further expensive manipulation by the database system. In the news article example, data can be injected into the recommendations data store following the schema of the store.

9781484212523_Fig02-14.jpg

Figure 2-14. Data-processing pipeline in the back office

Now that you have learned how Giraph fits in a system architecture, the following sections look at specialized architectures and pipelines for various real-world scenarios.

Giraph at a Web-Search Company

What is the core product of a company working in the web-search business? Its search engine. In the most general terms, providing results for web searches—searching the Web for specific content, as specified by the user through a query. However, when a user searches for pages through a query, the search engine does not surf the Web looking for pages that match that query. The Web is huge, and it would take days or weeks for such a search to return meaningful results. Instead, the search engine contains an index of the content of (a portion of) the Web, which is used to answer queries in a reasonable time (milliseconds). The problem is very similar to the one of the find and locate commands. In fact, it works analogously to locate. A search engine consists of a program that periodically updates a “database” (the index) of the content of the Web, and a program to search the database for content that matches a query, which is used by the search page (like the famous Google “white page” at www.google.com). The core of a search engine is composed of two main components: the crawler (also called a spider) and the index. The crawler is a program that constantly surfs the Web, jumping from page to page and following the links in the HTML of each page it visits. When the crawler reaches a new page or an updated version of a page it visited in the past, it sends the page to the indexer, which is responsible for keeping the index up to date with page content. Without getting into the details of how an index works, you can think of it as a map that links words to lists of URLs of pages that contain those words. Given a specific word, the map stores the URLs and the number of occurrences of that word in those pages. Figure 2-15 shows these components.

9781484212523_Fig02-15.jpg

Figure 2-15. The components of a web search engine: a crawler, an index, and a search page

To create the index, the indexer goes through each page passed to it by the crawler, divides the text in the page body into words, counts them, and updates the index accordingly. How does a search for a query in the index work—for example, a query expressed as a set of keywords? Suppose you submit the query Hello Giraph. The index searches for each specific single keyword and returns the URLs http://www.foo.com, http://www.bar.com, and http://www.baz.com, because they contain either both or only one of the keywords (see Figure 2-16). A page where the keywords appear more frequently than they do on another page is considered more relevant. The order in which search results are presented is also known as ranking. The search page returns the search results in this order because they are sorted by relevance (three for http://www.foo.com, two for http://www.bar.com, and one for http://www.baz.com). Keep in mind that the pure number of occurrences of the keywords on the pages is a simple metric of relevance. More sophisticated metrics consider, for example, document length and other parameters (search, for example, about TF-IDF to learn more). Indexers also use other techniques such as stemming, stop-words filtering, anchor text, and so on, but this example keeps things simple for the sake of clarity.

9781484212523_Fig02-16.jpg

Figure 2-16. An index generated from three web pages

This is more or less how a very basic search engine worked before Google and its PageRank algorithm. You might have noticed that the structure of the Web is used by the crawler only to discover pages to visit; it is not used for anything else, such as computing rankings. In this search engine, rankings are computed based on the relevance of each web page. This is why search engine optimization (SEO) techniques before Google were based on polluting web pages with long lists of keywords hidden from the human user. SEO would “pump up” a page’s rank for many different search queries. One of the things that made Google so successful in the beginning was its approach to ranking web pages. The PageRank algorithm uses the structure of the Web to rank pages. The intuition behind PageRank is very simple: the more pages link to a page, the higher rank the page should have. Moreover, the higher the rank of the pages that link to a page, the higher the rank of that page. It is a kind of meritocratic method to rank pages according to their popularity: if many popular pages link to a specific page, then that page must be popular and provide good content. All the necessary information lies in the link structure of the web pages. Although possible, it is more difficult to create pages with high popularity and add links from those pages to the page whose rank you want to raise.

Chapter 4 gets into the details of how PageRank works and how to implement it with Giraph. For now, consider that it is exactly the kind of graph algorithm that fits perfectly with an iterative computation with Giraph. The index can be built for an extremely large number of pages with a system like MapReduce, and the page rankings can be computed with Giraph. Both systems fit in the pipeline executed in the back office, along with the crawler. The only difference from the search engine architecture described so far would be to add a Giraph job to compute rankings. Once the rankings are computed, they can be used, together with the relevance of the pages, to decide the order in which the pages are presented to the user as search results. Today, many more metrics are added and considered by search engines like Google’s to compute page rankings; for example, consider social media, click statistics, and user profiles. However, the PageRank algorithm shows how to look at data as a graph and at a solution as a graph algorithm.

Now, let’s look at the pipeline in Figure 2-14 for the case of computing PageRank for a search engine with Giraph:

  • Input: Input comes directly from the crawler, so it depends on the way the crawler stores the pages it has fetched so far. If you want to update the current PageRank instead of computing it from scratch, the current values for the graph can be used as input as well.
  • Construction: The crawler can store the link graph separately from the pages as it parses it, or it may need to go through the pages and extract the links in the HTML to build the graph. URLs must be filtered out (if illegal), normalized, and de-duplicated. Often, URLs are transformed, with the domain converted to reverse-domain notation.
  • Computations: Once the graph is transformed, it can be loaded into Giraph and the computation can start. During this phase, the PageRank is computed for the loaded graph. Other computations can also be performed on the graph, such as computing related pages through other specific graph algorithms.
  • Fusion: The index was previously built through a MapReduce job. You may want to add the PageRank directly to the entries for each URL in the index. This would let you fetch URLs at query time, along with relevance and PageRank values, to compute the order in which to present the results.
  • Output: The updated index data with the PageRank values is injected into the index store so it can be used directly by the search page.

This is just an example of a processing pipeline, and it depends on the features supported by the search engine. The page indexing, supposedly performed through MapReduce, can be considered part of this pipeline, executed right before the PageRank computation, or part of its own pipeline. Now that you have seen how Giraph can help a web-search company, let’s move to another scenario.

Giraph at an E-Commerce Company

Users generally do two types of things on an e-commerce site: search for something specific they want, which probably brought them there in the first place; or click around and stumble onto something that triggers their attention and that they might decide to buy. Think about one of your browsing sessions on Amazon. To a certain extent, the infrastructure to support the first type of usage is similar to that of a web-search company. To support search, product data is indexed so that it can be found through queries. But because the data is already stored in the company’s databases, it does not need to be discovered through a crawler. The moment the data is inserted into the site, it is processed and added to the index directly. The second type of usage is more similar to surfing the Web. From link to link, the user explores the available products on the site, looking for something they like. This means the links on a product page must be carefully selected to stimulate a user click and increase the likelihood of the user finding something interesting. But what makes a link a good link for a product page? What link would be effective on the page for The Dark Side of the Moon”? Other records by Pink Floyd, probably, or something that fits the taste of a fan of that record. And what is the taste of such a fan? It is complex to delineate, and it depends on different factors.

Recommender systems are designed to recommend items to users (and sometimes users to users), such as a product that might be interesting to them. In the current scenario, an effective recommender system generates links for a given product so a user can easily find something interesting without needing to search for it explicitly. Although this task can be performed by humans, such as expert clerks who place similar records close to each other on shelves in a store, it can be effective to let a computer program perform it: with millions of products, having humans organize recommendations does not scale; and a data-driven approach should produce more unbiased results.

What does all this have to do with Giraph? It turns out that e-commerce site data can be modeled through graphs. An e-commerce site has two types of entities: products and customers (or users). You can model both of them with a vertex. Technically, they can be represented by two types of vertices, one for each type of entity (for example, visualizing users with circles and items with squares). Edges connect vertices when customers buy products. Because edges always connect two vertices of different types, this type of graph is also known as a bipartite graph. Figure 2-17 shows an example. Sites can also support product ratings—for example, using stars. Ratings can be easily modeled with edge weights, where the weight is set to the rating given to a product by a customer. This graph data can be analyzed following the popular “people who bought this item also bought these other items” pattern. By performing this analysis, you can automatically discover the profiles of buyers of certain products, by using customers’ purchase history. You can also answer the question about which items might be interesting to a customer of The Dark Side of the Moon. This profile data is often referred as a latent profile, because it is not directly observed but rather is inferred from user behavior. This approach is also known as collaborative filtering, because users collaborate (implicitly) in building filters that organize the item-recommendation database.

9781484212523_Fig02-17.jpg

Figure 2-17. E-commerce site data represented by a graph with users, items, and ratings

Collaborative filtering is an effective technique to build recommender systems. However, it is not the only one. Content-based recommenders follow a different approach. Instead of looking (only) at purchase history, you can recommend products based on qualitative evidence. For example, imagine having metadata about music albums such as genre, year of release, place of recording, mood, style, tags, and so on. Based on this metadata, you could define a similarity metric to measure how similar two albums are—for example, whether they belong to the same genre and were recorded in the same period and country. Once the similarity between items has been established, you can build a graph in which similar products are connected by an edge and the weight of the edge encodes how similar those items are. Figure 2-18 shows an example of such a graph. With this information, if a customer buys The Dark Side of the Moon, the system can recommend other albums that have similar metadata (other psychedelic rock albums from the 1970s), maybe also considering other products the customer purchased in the past. This approach is similar to having the aforementioned expert clerk organizing the shelves of the shop for you, but in a more personalized and scalable way. Depending on the type of products, content-based similarity can be computed even without metadata. For years, scientists have been researching methods to recognize similarities between books by analyzing their text, songs by analyzing their audio signal, and images by analyzing shapes and colors. However, this goes beyond the scope of large-scale graph processing.

9781484212523_Fig02-18.jpg

Figure 2-18. E-commerce site data represented by a graph from metadata

Both approaches have pros and cons. The advantage of collaborative filtering is that it does not require any type of additional data other than the purchase history. Moreover, it can compute recommendations across different genres of products. For example, it can recommend books based on purchase data about movies and music albums (maybe people who buy albums of metal rock music also buy black t-shirts). On the other hand, content-based recommendations do not suffer from the so-called “bootstrap problem,” because items can be recommended from day one without the need for historical data. Often, the two approaches are combined, with content-based recommendations used in the early days and the system gradually migrated to collaborative filtering. This book is not dedicated to building recommender systems; other books cover that topic. The point is that it is possible (and often desirable) to model e-commerce data with graphs and recommendation algorithms as algorithms that explore them. These graphs can get very large, so you can use Giraph to compute graph algorithms to build a recommender system for an e-commerce site. Chapter 4 presents an algorithm that serves this purpose; you can use it as a building block for your particular data model and use-case scenario.

A recommender is only an example of how Giraph can be used to analyze data from an e-commerce site. Analyzing customer behavior can give you general information about the effectiveness of marketing campaigns or site organization, among other things.

In general, the processing pipeline for an e-commerce site may look like the following:

  • Input: The input comes from historical purchase data, from raw web server logs (such as for the click graph), from a structured database in the case of content-based recommendations, and so on.
  • Construction: The graph can be constructed by using products and customers as vertices and connecting them with purchase data or with click information. Usually, data is normalized to create weights—for example, the number of clicks on each page normalized on the total number of clicks. The similarity between items can also be computed according to different criteria—for example, through non-graphical models via MapReduce jobs or Apache Mahout.
  • Computations: Once the graph is transformed, it can be loaded into Giraph and the algorithms can be computed. In the case of recommendation algorithms, the latent profile for each user is computed. These latent profiles can be used to compute predictions of ratings and recommendations in another step of the pipeline, perhaps through MapReduce.
  • Fusion: The latent profiles and recommendations can be fused with structured data to produce new product pages with updated links to similar products. They can also be loaded into the database or personalized recommendations in each user’s home page. If recommendations are computed online based on latent profiles computed offline, the profile data is prepared to be loaded into the online system.
  • Output: Recommendations are sent to the recommender data store, and web pages are updated with new links.

You have now seen how Giraph can fit into an architecture to support the workload of an e-commerce site, in particular with an eye to building a recommender system. The next section focuses on a final scenario: an online social networking company.

Giraph at an Online Social Networking Company

One of the reasons for the success of social networking sites is that they allow us to connect with people and share information about what we do and what we like. The reason why we connect with people on these sites depends on each site. For example, on Facebook, users tend to connect with people they already know in real life, whereas on Twitter they connect with people they are interested in, and on Flickr they connect with photographers whose work they like. Regardless of the reason, the connections created with other individuals generate an interesting social fabric. Representing these networks as graphs and processing them through graph algorithms allows you to study the underlying social dynamics reflected in the graph.

Facebook published an image in December 2010.1 Although the image resembles a satellite picture of the world at night (and that is what makes it so striking!), it is not. The picture presents a portion of the Facebook social graph. To draw this image, Paul Butler analyzed the friendship relationships of the users of Facebook and visualized the data according to the following criteria:

  • For each pair of cities, Paul counted the number of friends between the two cities and connected the two cities with an edge.
  • For each edge, he computed a weight as a function of the distance between the two cities and the number of friends between them.
  • He placed the edges according to the geospatial information of the cities they connected. In other words, he placed each vertex (a city) according to its position on the map, but without visualizing it, hence visualizing only edges.
  • To choose the color of the edges, he chose a shade from white to blue, with the highest weight mapped to the white color.
  • Long edges, like those connecting cities in different continents, were drawn with arcs around the world, to minimize overlapping and increase readability.

This example clearly shows that individuals on Facebook are connected by geographical relationships, because people connect on Facebook mostly with people they hang out with in real life (or used to). The description of the criteria used to analyze the data and compute the edge weights for the visualization is in practice a graph algorithm.

One of the most interesting properties of social networks is their division into communities. In social network analysis (SNA), a community is a group of individuals who are tightly connected with each other more than they are connected with individuals outside of that community. Intuitively, you can think of a community as a cluster. The graph representing a social network has particular properties due to this division into communities, and graph algorithms can take advantage of these properties to detect communities. Figure 2-19 shows a portion of one of the authors’ communities, extracted from his LinkedIn account and visualized through the LinkedIn Maps tool.2 Here, each vertex represents a user on LinkedIn, and an edge connects two vertices if the two users they stand for are connected on LinkedIn. A community-detection algorithm was run on this graph, and the vertices were colored according to the community the algorithm assigned them to. Note that the algorithm did not use the information contained in each user’s profile, such as past jobs, to detect communities. Instead, it used only the information contained in the graph about how users were connected. The algorithm successfully detected communities: each color is indeed mapped to one of the present or past communities the author is a member of, such as his previous and current co-workers, the Apache Giraph community, and so on. Chapter 4 presents an algorithm to detect communities with Giraph. Interestingly, the graph-layout algorithm used to draw the graph is also a graph algorithm. The graph-layout algorithm computes the position of the vertices in two-dimensional space such that connected vertices appear nearby and edge crossings are minimized (for readability). The graph-layout algorithm uses only the connections in the graph, and the fact that vertices with the same color (same community) appear together is another sign of the strong community structure in the graph.

9781484212523_Fig02-19.jpg

Figure 2-19. Communities in the social network of a LinkedIn user

The definition of community is not strict and not unique. Many real-world communities overlap, because each of us belongs to multiple communities at the same time. For example, on Facebook you may be connected to your school friends but also to your co-workers, your football teammates, and so on. Moreover, communities are often organized in hierarchies. You may be a member of your school community but also of the inner communities of your particular class, the school band, the basketball team you were a member of, and so on. This means a community-detection algorithm has to be tuned to the particular use case.

Why would a social networking site be interested in detecting communities? First, to study its users. By studying the communities of a social network, it is possible to understand how the users are organized and connected. This lets you target specific group behaviors and analyze the effectiveness of specific features. A site can also support features such as “people you may know.” For example, Figure 2-20 presents a community of friends. By using the friendship relationships, the site can help users connect with old friends. But communities are not only useful for social networking sites. Think of the recommender system for the e-commerce site in the previous section. The algorithm presented in this book to compute recommendations generates profiles about customers and items that allow you to make predictions of ratings. It basically creates a function that, given a customer and any item, predicts the rating the customer will give to that item based on past ratings. What is still required is the “matchmaking”—finding the items that are predicted to receive a high rating. The naive solution to this problem is to match the profile of a customer with all unrated items and keep the ones with the highest rating. However, this solution would mean matching every customer with every item, which would require a massive amount of computation and would be unfeasible. Another solution is to look into a customer’s communities and match their profile only with items in those communities. Customer-item graphs, such as those presented in the previous section, tend to be organized into communities, usually delineating genres or tastes. It is very likely that people who like rock music will buy similar albums. This community structure tends to be even stronger if customers are allowed to connect on the site by creating an explicit social network. Using a community-detection algorithm on the customer-item graph is, in effect, a clustering computation.

9781484212523_Fig02-20.jpg

Figure 2-20. Recommendation of a friend based on a community structure

Rankings can also be applied to social networks. Think of Twitter. On social media sites, users are usually ranked by their influence. Intuitively, a user with high influence is a user whose actions and behaviors reach deeply into the graph, effectively increasing the likelihood of influencing other users. Users with high influence are called influencers. On Twitter, influencers can be computed by looking at how their tweets spread across the network of followers, and by looking at actions such as retweets and mentions. You may want to give priority to influencers in the list of recommended individuals to follow and give their tweets a higher probability of being shown on a user timeline. Usually, algorithms to compute influencers use the same paradigm of PageRank (one of these algorithms is called TweetRank), but they also consider temporal aspects, such as how quickly tweets are spread.

Giraph fits in the back office of a social networking site in the pipeline that analyzes the social network. The different steps could be as follows:

  • Input: The input comes from the different data stores where the social relationships and profiles (gender, age, country, and so on) are stored, and from raw web server logs (such as for the click graph). Which data is used depends on the type of application.
  • Construction: Selecting a portion of the graph filtered depending on certain types of relationships and profile characteristics is often called a projection of the graph. For example, the social graph of male professionals living in the United States is a projection of the LinkedIn graph. The projection of the graph is constructed by filtering out unwanted data and normalizing edge weights. Often, different projections must be computed for different algorithms. Other times, different projects have to be merged in later steps of the pipeline.
  • Computations: The graph is used to compute communities, rankings, influencers, and so on. The different computations are executed and pipelined according to the semantics of the analysis; for example, rankings can be used to compute influencers. At this point, for summarization and reports, data can be aggregated according to criteria such as membership in a community or profile data—for example, the average number of connections for male professionals in the US. Aggregations can be computed directly in Giraph, through tools like Hive, or directly through MapReduce.
  • Fusion: The results of the computations are used to build the materialization data to be injected into the stores. Communities and rankings are used to build friend recommendations, and influencer rankings are used to recommend data items, depending on the sources.
  • Output: Materialized data items are inserted into the databases or sent to the analytics teams.

Again, these are just some examples of features of an online social networking application. Through the examples provided so far, you have learned how to position Giraph in your architecture and how to integrate it with your existing data-processing pipeline.

Summary

Graphs are everywhere, and you can use them to describe many things in many different domains. By looking at data and problems through graphs, you can gain a better picture of your applications and products. Giraph helps you process this data at scale, as part of your application architecture.

In this chapter, you learned the following:

  • Graphs are very simple structures, and they can be used to describe a number of different things.
  • Depending on the type of data, you can use labels, directed edges, weights, and so on to represent specific features of your data.
  • Computations that are computationally expensive but need to return results within milliseconds can be performed with a combination of online and offline computations.
  • Giraph is an offline graph-processing engine that sits in the back office of an application architecture.
  • You can use Giraph to compute rankings, recommendations, communities, and so on in cooperation with the other systems in your architecture.

Now that you have seen how to model your data with graphs and how to fit Giraph into your application, you are ready to look at how to program Giraph. The next chapter discusses the programming model and the API.

________________________

1Paul Butler, “Visualizing Friendships,” Facebook, www.facebook.com/note.php?note_id=469716398919.

2http://inmaps.linkedinlabs.com.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset