.textFile(...) method

To read the file, we are using SparkContext's textFile() method via this command:

(
sc
.textFile(
'~/data/flights/airport-codes-na.txt'
, minPartitions=4
, use_unicode=True
)
)

Only the first parameter is required, which indicates the location of the text file as per ~/data/flights/airport-codes-na.txt. There are two optional parameters as well:

  • minPartitions: Indicates the minimum number of partitions that make up the RDD. The Spark engine can often determine the best number of partitions based on the file size, but you may want to change the number of partitions for performance reasons and, hence, the ability to specify the minimum number.
  • use_unicode: Engage this parameter if you are processing Unicode data.

Note that if you were to execute this statement without the subsequent map() function, the resulting RDD would not reference the tab-delimiter—basically a list of strings that is:

myRDD = sc.textFile('~/data/flights/airport-codes-na.txt')
myRDD.take(5)

# Out[35]: [u'City State Country IATA', u'Abbotsford BC Canada YXX', u'Aberdeen SD USA ABR', u'Abilene TX USA ABI', u'Akron OH USA CAK']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset