.zipWithIndex() transformation

The zipWithIndex() transformation appends (or ZIPs) the RDD with the element indices. This is very handy when wanting to remove the header row (first row) of a file.

Look at the following code snippet:

# View each row within RDD + the index 
# i.e. output is in form ([row], idx)
ac = airports.map(lambda c: (c[0], c[3]))
ac.zipWithIndex().take(5)

This will generate this result:

# Output
[((u'City', u'IATA'), 0),
((u'Abbotsford', u'YXX'), 1),
((u'Aberdeen', u'ABR'), 2),
((u'Abilene', u'ABI'), 3),
((u'Akron', u'CAK'), 4)]

To remove the header from your data, you can use the following code:

# Using zipWithIndex to skip header row
# - filter out row 0
# - extract only row info
(
ac
.zipWithIndex()
.filter(lambda (row, idx): idx > 0)
.map(lambda (row, idx): row)
.take(5)
)

The preceding code will skip the header, as shown as follows:

# Output
[(u'Abbotsford', u'YXX'),
(u'Aberdeen', u'ABR'),
(u'Abilene', u'ABI'),
(u'Akron', u'CAK'),
(u'Alamosa', u'ALS')]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset