How it works...

You should know this one by now, but the .count() method counts how many rows there are in our DataFrame. The second command checks how many distinct rows we have. Execute these two commands on our dirty_data. DataFrame produces (22, 21) as the result. So, we now know that we have two records in our dataset that are exact copies of each other. Let's see which ones:

(
dirty_data
.groupby(dirty_data.columns)
.count()
.filter('count > 1')
.show()
)

Let's unpack what's happening here. First, we use the .groupby(...) method to define what columns to use for the aggregation; in this example, we essentially use all of them as we want to find all the distinct combinations of all the columns in our dataset. Next, we count how many times such a combination of values occurs using the .count() method; the method adds the count column to our dataset. Using the .filter(...) method, we select all the rows that occur in our dataset more than once and print them to the screen using the .show() action.

This produces the following result:

So, the row with Id equal to 16 is the duplicated one. So, let's drop it using the .dropDuplicates(...) method. Finally, running the full_removed.count() command confirms that we now have 21 records.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset