How it works...

As pointed out earlier, each element of the RDD inside the DataFrame is a Row(...) object. You can check it by running these two statements:

sample_data_df.rdd.take(1)

And:

sample_data.take(1)

The first one produces a single-item list where the element is Row(...):

The other also produces a single-item list, but the item is a tuple:

The sample_data RDD is the first RDD we created in the previous recipe.

With that in mind, let's now turn our attention to the code. 

First, we load the necessary modules: to work with the Row(...) objects, we need pyspark.sql, and we will use the .round(...) method later, so we need the pyspark.sql.functions submodule.

Next, we extract .rdd from sample_data_df. Using the .map(...) transformation, we first add the HDD_size column to the schema.

Since we are working with RDDs, we want to retain all the other columns. Thus, we first convert the row (which is a Row(...) object) into a dictionary using the .asDict() method, so then we can later unpack it using **.

In Python, the single * preceding a list of tuples, if passed as a parameter to a function, passes each element of a list as a separate argument to the function. The double ** takes the first element and turns it into a keyword parameter, and uses the second element as the value to be passed.

The second argument follows a simple convention: we pass the name of the column we want to create (the HDD_size), and set it to the desired value. In our first example, we split the .HDD column and extract the first element since it is HDD_size.

We repeat this step twice more: first, to create the HDD_type column, and second, to create the Volume column.

Next, we use the .toDF(...) method to convert our RDD back to a DataFrame. Note that you can still use the .toDF(...) method to convert a regular RDD (that is, where each element is not a Row(...) object) to a DataFrame, but you will you need to pass a list of column names to the .toDF(...) method or you end up with unnamed columns.

Finally, we .select(...) the columns so we can .round(...) the newly created Volume column. The .alias(...) method produces a different name for the resulting column.

The resulting DataFrame looks as follows:

Unsurprisingly, the desktop iMac would require the biggest box.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset