There's more...

Before we read the data into our GraphFrame, let's create one more DataFrame:

import pyspark.sql.functions as f
import pyspark.sql.types as t

@f.udf
def toDate(weirdDate):
year = '2014-'
month = weirdDate[0:2] + '-'
day = weirdDate[2:4] + ' '
hour = weirdDate[4:6] + ':'
minute = weirdDate[6:8] + ':00'

return year + month + day + hour + minute

deptsDelays = deptsDelays.withColumn('normalDate', toDate(deptsDelays.date))
deptsDelays.createOrReplaceTempView("deptsDelays")

# Get key attributes of a flight
deptsDelays_GEO = spark.sql("""
select cast(f.date as int) as tripid
, cast(f.normalDate as timestamp) as `localdate`
, cast(f.delay as int)
, cast(f.distance as int)
, f.origin as src
, f.destination as dst
, o.city as city_src
, d.city as city_dst
, o.state as state_src
, d.state as state_dst
from deptsDelays as f
join airports as o
on o.iata = f.origin
join airports as d
on d.iata = f.destination
""")

# Create Temp View
deptsDelays_GEO.createOrReplaceTempView("deptsDelays_GEO")

# Cache and Count
deptsDelays_GEO.cache()
deptsDelays_GEO.count()

The preceding code snippet packs some additional optimizations to create the deptsDelays_GEO DataFrame:

  • It creates a tripid column that allows us to uniquely identify each trip. Note that this is a bit of a hack as we had converted the date (each trip has a unique date in this dataset) into an int column.
  • The date column isn't really a traditional date per se as it is in the format of MMYYHHmm. Therefore, we first apply a udf to convert it into a proper format (the toDate(...) method). We then convert it into an actual timestamp format.
  • Re-casts the delay and distance columns into integer values as opposed to string.
  • In the following sections, we will be using the airport codes (the iata column) as our vertex. To create the edges for our graph, we will need to specify the IATA codes for the source (originating airport) and destination (destination airport). The join statement and renaming of f.origin as src and f.destination as dst are in preparation for creating the GraphFrame to specify the edges (they are explicitly looking for the src and dst columns).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset