How to do it...

To prepare our data for our graph, we will initially clean up the data and include only the airport codes that exist within the available flight data. That is, we exclude any airports that do not exist in the DepartureDelays.csv dataset. The upcoming recipe executes the following:

  1. Sets the file paths to the files you had downloaded 
  2. Creates the apts and deptDelays DataFrames by reading the CSV files and inferring the schema, configured with headers
  3. The iata contains only the airport codes (the IATA column) that exist in the deptDelays DataFrame
  4. Joins the iata and apts DataFrames to create the apts_df DataFrame

The reason we filter out the data to create the airports DataFrame is that when we create our GraphFrame in the following recipes, we will only have vertices with edges for our graph:

# Set File Paths
delays_fp = "/data/departuredelays.csv"
apts_fp = "/data/airport-codes-na.txt"

# Obtain airports dataset
apts = spark.read.csv(apts_fp, header='true', inferSchema='true', sep=' ')
apts.createOrReplaceTempView("apts")

# Obtain departure Delays data
deptsDelays = spark.read.csv(delays_fp, header='true', inferSchema='true')
deptsDelays.createOrReplaceTempView("deptsDelays")
deptsDelays.cache()

# Available IATA codes from the departuredelays sample dataset
iata = spark.sql("""
select distinct iata
from (
select distinct origin as iata
from deptsDelays

union all
select distinct destination as iata
from deptsDelays
) as a
""")
iata.createOrReplaceTempView("iata")

# Only include airports with atleast one trip from the departureDelays dataset
airports = sqlContext.sql("""
select f.IATA
, f.City
, f.State
, f.Country
from apts as f
join iata as t
on t.IATA = f.IATA
""")
airports.registerTempTable("airports")
airports.cache()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset