Only IDs differ

If you collect data over time, you might record the same data but with different IDs. Let's check whether our DataFrame has any such records. The following snippet will help you do this:

(
full_removed
.groupby([col for col in full_removed.columns if col != 'Id'])
.count()
.filter('count > 1')
.show()
)

Just like before, we first group by all the columns but we exclude the 'Id' column, then count how many records we get given from this grouping, and finally we extract those with 'count > 1' and show them on the screen. After running the preceding code, here's what we get:

As you can see, we have four records with different IDs but that are the same cars: the BMW 440i Coupe and the Hyundai G80 AWD.

We could also check the counts, like before:

no_ids = (
full_removed
.select([col for col in full_removed.columns if col != 'Id'])
)

no_ids.count(), no_ids.distinct().count()

First, we only select all the columns except the 'Id' one, and then count the total number of rows and the total number of distinct rows. After running the previous snippet, you should see (21, 19), indicating that we have four records that are duplicated, just like we saw earlier.

The .dropDuplicates(...) method can handle such situations easily. All we need to do is to pass to the subset parameter a list of all the columns we want it to consider while searching for the duplicates. Here's how:

id_removed = full_removed.dropDuplicates(
subset = [col for col in full_removed.columns if col != 'Id']
)

Once again, we select all the columns but the 'Id' columns to define which columns to use to determine the duplicates. If we now count the total number of rows in the id_removed DataFrame, we should get 19:

And that's precisely what we got!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset