Getting ready

This recipe will be reading a tab-delimited (or comma-delimited) file, so please ensure that you have a text (or CSV) file available. For your convenience, you can download the airport-codes-na.txt and departuredelays.csv files from learning http://bit.ly/2nroHbh. Ensure your local Spark cluster can access this file (~/data/flights/airport-codes-na.txt).

If you are running Databricks, the same file is already included in the /databricks-datasets folder; the command is 

myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda line: line.split(" "))

Many of the transformations in the next section will use the RDDs airports or flights; let's set them up by using the following code snippet:

# Setup the RDD: airports
airports = (
sc
.textFile('~/data/flights/airport-codes-na.txt')
.map(lambda element: element.split(" "))
)

airports.take(5)

# Output
Out[11]:
[[u'City', u'State', u'Country', u'IATA'],
[u'Abbotsford', u'BC', u'Canada', u'YXX'],
[u'Aberdeen', u'SD', u'USA', u'ABR'],
[u'Abilene', u'TX', u'USA', u'ABI'],
[u'Akron', u'OH', u'USA', u'CAK']]


# Setup the RDD: flights
flights = (
sc
.textFile('~/data/flights/departuredelays.csv', minPartitions=8)
.map(lambda line: line.split(","))
)

flights.take(5)

# Output
[[u'date', u'delay', u'distance', u'origin', u'destination'],
[u'01011245', u'6', u'602', u'ABE', u'ATL'],
[u'01020600', u'-8', u'369', u'ABE', u'DTW'],
[u'01021245', u'-2', u'602', u'ABE', u'ATL'],
[u'01020605', u'-4', u'602', u'ABE', u'ATL']]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset