ID collisions

You might also assume that if there are two records with the same ID, they are duplicated. Well, while this might be true, we would have already removed them by now when dropping the records based on all the columns. Thus, at this point, any duplicated IDs are more likely collisions.

Duplicated IDs might arise for a multitude of reasons: an instrumentation error or insufficient data structure to store the IDs, or if the IDs represent some hash function of the record elements, there might be collisions arising from the choice of the hash function. These are just a few of the reasons why you might have duplicated IDs but the records are not really duplicated.

Let's check whether this is true for our dataset:

import pyspark.sql.functions as fn

id_removed.agg(
fn.count('Id').alias('CountOfIDs')
, fn.countDistinct('Id').alias('CountOfDistinctIDs')
).show()

In this example, instead of subsetting records and then counting the records, then counting the distinct records, we will use the .agg(...) method. To this end, we first import all the functions from the pyspark.sql.functions module.

For a list of all the functions available in pyspark.sql.functions, please refer to https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions.

The two functions we'll use will allow us to do the counting in one go: the .count(...) method counts all the records with non-null values in the specified column, while the .countDistinct(...) returns a count of distinct values in such a column. The .alias(...) method allows us to specify a friendly name for the columns resulting from the counting. Here's what we get after counting:

OK, so we have two records with the same IDs. Again, let's check which IDs are duplicated:

(
id_removed
.groupby('Id')
.count()
.filter('count > 1')
.show()
)

As before, we first group by the values in the 'Id' column, and then show all the records with a count greater than 1. Here's what we get:

Well, it looks like we have two records with 'Id == 3'. Let's check whether they're the same:

These are definitely not the same records but they share the same ID. In this situation, we can create a new ID that will be unique (we have already made sure we do not have other duplicates in our dataset). PySpark's SQL functions module offers a .monotonically_increasing_id() method that creates a unique stream of IDs.

The .monotonically_increasing_id()—generated ID is guaranteed to be unique as long as your data lives in less than one billion partitions and with less than eight billion records in each. That's a pretty big number.

Here's a snippet that will create and replace our ID column with a unique one:

new_id = (
id_removed
.select(
[fn.monotonically_increasing_id().alias('Id')] +
[col for col in id_removed.columns if col != 'Id'])
)

new_id.show()

We are creating the ID column first and then selecting all the other columns except the original 'Id' column. Here's what the new IDs look like:

The numbers are definitely unique. We are now ready to handle the other problems in our dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset