Property graphs

Before we move on to introduce GraphX as a graph processing engine, let's look at an extension of graphs as we have seen them before. We have already considered labeled graphs as a convenient way to name vertices and edges. In general, the graph data we will consider in the applications will have more information attached to both vertices and edges, and we need a way to model this additional information within our graph. To this end, we can utilize the notion of property graphs.

From the basic definition of a graph as a pair of vertices and edges, it is not directly possible to attach additional information to the two structures. Historically, one way to circumvent this is to blow up the graph and create more vertices corresponding to properties, connected to the original vertices by new edges that encode the relationship to the new vertices. For instance, in our previous examples of friend graphs, if we also want to encode the home addresses in our graph, each vertex representing a person must be connected to a vertex representing their address with the edge between them lives at. It does not take a lot of imagination to realize that this approach creates a lot of complexity, especially if the vertex properties interrelate. Representing properties in a graph by subject-predicate-object triples has been formalized in the so-called Resource Description Framework (RDF), and the result of this is called an RDF-model. RDFs are a subject on their own and allow for a little more flexibility than we presented. In any case, it is good to be familiar with the concept and understand its limitations.

In a property graph, in contrast, we can augment both vertices and edges with essentially arbitrary additional structure. As with anything, gaining flexibility in this generality usually comes as a trade-off. In our case, basic graphs as implemented in many graph databases allow for the powerful optimization of queries, while with property graphs, we should be careful when it comes to performance. We will touch this topic in more detail in the next section, when we show how Spark GraphX implements property graphs.

Throughout the rest of the chapter, we'll use the following convention for property graphs. The additional data attached to vertices is called vertex data and the one for edges is called edge data. To give an example of a little more involved vertex and edge data, see the following diagram for an extension of our idea of extending a friend graph. This example also displays what we mean by a triplet, that is, an edge with its adjacent vertices and all their properties:

Figure 9: A property graph showing friends augmented by address data, connected by more than one relation. Property data is encoded in JSON format.

Note that in the preceding example, we kept it simple on purpose, but in a more realistic scenario, we would have the need for nested data structures--for example, to answer how much money is owed and when it is due.

An interesting special case of a property graph in our context is that of a weighted graph, in which edges, vertices, or both have weights, for example, integers or floating point numbers attached to them. A prototypical example for this is a graph consisting of a set of cities as vertices ,with the edges connecting them carrying the distance between locations. A few classical questions arise in this scenario. One example would be to find the shortest path between two given cities. A related issue is the traveling salesman problem, in which a hypothetical salesman is asked to visit every city using the shortest route possible.

As a closing remark for this section, it is important to know that in literature, there is a widely used synonymous notion for vertices, namely nodes. We will not use this term here, since in the context of Spark, it might easily be confused with compute nodes on which workers execute tasks. Instead, we will stick to vertices throughout the chapter. Also, whenever we speak of a graph, we generally assume that it is a finite graph, that is, the number of vertices and edges is finite, which, in practice, hardly counts as restriction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset