How to do it...

To prepare our data for our graph, we will initially clean up the data and include only the airport codes that exist within the available flight data. That is, we exclude any airports that do not exist in the DepartureDelays.csv dataset. The upcoming recipe executes the following:

Sets the file paths to the files you had downloaded
Creates the apts and deptDelays DataFrames by reading the CSV files and inferring the schema, configured with headers
The iata contains only the airport codes (the IATA column) that exist in the deptDelays DataFrame
Joins the iata and apts DataFrames to create the apts_df DataFrame

The reason we filter out the data to create the airports DataFrame is that when we create our GraphFrame in the following recipes, we will only have vertices with edges for our graph:

# Set File Paths
delays_fp = "/data/departuredelays.csv"
apts_fp = "/data/airport-codes-na.txt"

# Obtain airports dataset
apts = spark.read.csv(apts_fp, header='true', inferSchema='true', sep='	')
apts.createOrReplaceTempView("apts")

# Obtain departure Delays data
deptsDelays = spark.read.csv(delays_fp, header='true', inferSchema='true')
deptsDelays.createOrReplaceTempView("deptsDelays")
deptsDelays.cache()

# Available IATA codes from the departuredelays sample dataset
iata = spark.sql("""
    select distinct iata 
    from (
        select distinct origin as iata 
        from deptsDelays 
        
        union all 
        select distinct destination as iata 
        from deptsDelays
    ) as a
""")
iata.createOrReplaceTempView("iata")


# Only include airports with atleast one trip from the departureDelays dataset
airports = sqlContext.sql("""
    select f.IATA
        , f.City
        , f.State
        , f.Country 
    from apts as f 
    join iata as t 
        on t.IATA = f.IATA
""")
airports.registerTempTable("airports")
airports.cache()

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...