The .fillna(...) transformation

The .fillna(...) transformation fills in the missing values in a DataFrame. You can either specify a single value and all the missing values will be filled in with it, or you can pass a dictionary where each key is the name of the column, and the values are to fill the missing values in the corresponding column. No direct equivalent exists in the SQL world.

Look at the following code:

missing_df = sc.parallelize([
(None, 36.3, 24.2)
, (1.6, 32.1, 27.9)
, (3.2, 38.7, 24.7)
, (2.8, None, 23.9)
, (3.9, 34.1, 27.9)
, (9.2, None, None)
]).toDF(['A', 'B', 'C'])

missing_df.fillna(21.4).show()

It produces the following output:

We could also specify the dictionary, as the 21.4 value does not really fit the A column. In the following code, we first calculate averages for each of the columns:

miss_dict = (
missing_df
.agg(
f.mean('A').alias('A')
, f.mean('B').alias('B')
, f.mean('C').alias('C')
)
).toPandas().to_dict('records')[0]

missing_df.fillna(miss_dict).show()

The .toPandas() method is an action (that we will cover in the next recipe) and it returns a pandas DataFrame. The .to_dict(...) method of the pandas DataFrame converts it into a dictionary, where the records parameter produces a regular dictionary where each column is the key and each value is the record.

The preceding code produces the following result:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset