A duplicate is a record in your dataset that appears more than once. It is an exact copy. Spark DataFrames have a convenience method to remove the duplicated rows, the .dropDuplicates() transformation:
- Check whether any rows are duplicated, as follows:
dirty_data.count(), dirty_data.distinct().count()
- If any are duplicates, remove them:
full_removed = dirty_data.dropDuplicates()