.map(...) method

To make sense of the tab-delimiter with an RDD, we will use the .map(...) function to transform the data from a list of strings to a list of lists:

myRDD = (
sc
.textFile('~/data/flights/airport-codes-na.txt')
.map(lambda element: element.split(" "))
)

The key components of this map transformation are:

  • lambda: An anonymous function (that is, a function defined without a name) composed of a single expression
  • split: We're using PySpark's split function (within pyspark.sql.functions) to split a string around a regular expression pattern; in this case, our delimiter is a tab (that is,  )

Putting the sc.textFile() and map() functions together allows us to read the text file and split by the tab-delimiter to produce an RDD composed of a parallelized list of lists collection:

Out[22]:  [[u'City', u'State', u'Country', u'IATA'], [u'Abbotsford', u'BC', u'Canada', u'YXX'], [u'Aberdeen', u'SD', u'USA', u'ABR'], [u'Abilene', u'TX', u'USA', u'ABI'], [u'Akron', u'OH', u'USA', u'CAK']]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset