How to do it...

A duplicate is a record in your dataset that appears more than once. It is an exact copy. Spark DataFrames have a convenience method to remove the duplicated rows, the .dropDuplicates() transformation:

  1. Check whether any rows are duplicated, as follows: 
dirty_data.count(), dirty_data.distinct().count()
  1. If any are duplicates, remove them:
full_removed = dirty_data.dropDuplicates()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset