Understanding motifs

To easily understand the complex relationship of city airports and the flights between each other, we can use motifs to find patterns of airports (for example, vertices) connected by flights (that is, edges). The result is a DataFrame in which the column names are given by the motif keys. Note that motif finding is one of the new graph algorithms supported as part of GraphFrames.

For example, let's determine the delays that are due to San Francisco International Airport (SFO):

# Generate motifs
motifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")
  .filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid < ab.tripid + 10000")

# Display motifs
display(motifs)

Breaking down the preceding query, the (x) represents the vertex (that is, airport) while the [xy] represents the edge (that is, flights between airports). Therefore, to determine the delays that are due to SFO, use the following:

  • The vertex (b) represents the airport in the middle (that is, SFO)
  • The vertex(a)represents the origin airport (within the dataset)
  • The vertex (c) represents the destination airport (within the dataset)
  • The edge [ab] represents the flight between (a) (that is, origin) and (b) (that is, SFO)
  • The edge [bc] represents the flight between (b) (that is, SFO) and (c) (that is, destination)

Within the filter statement, we put in some rudimentary constraints (note that this is an over simplistic representation of flight paths):

  • b.id = 'SFO' denotes that the middle vertex (b) is limited to just SFO airport
  • (ab.delay > 500 or bc.delay > 500) denotes that we are limited to flights that have delays greater than 500 minutes
  • (bc.tripid > ab.tripid and bc.tripid < ab.tripid + 10000) denotes that the (ab) flight must be before the (bc) trip and within the same day. The tripid was derived from the date time, thus explaining why it could be simplified this way

The output of this query is noted in the following figure:

Understanding motifs

The following is a simplified abridged subset from this query where the columns are the respective motif keys:

a

ab

b

bc

c

Houston (IAH)

IAH -> SFO (-4)

[1011126]

San Francisco (SFO)

SFO -> JFK (536)

[1021507]

New York (JFK)

Tuscon (TUS)

TUS -> SFO (-5)

[1011126]

San Francisco (SFO)

SFO -> JFK (536)

[1021507]

New York (JFK)

Referring to the TUS > SFO > JFK flight, you will notice that while the flight from Tuscon to San Francisco departed 5 minutes early, the flight from San Francisco to New York JFK was delayed by 536 minutes.

By using motif finding, you can easily search for structural patterns in your graph; by using GraphFrames, you are using the power and speed of DataFrames to distribute and perform your query.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset