As pointed out earlier, each element of the RDD inside the DataFrame is a Row(...) object. You can check it by running these two statements:
sample_data_df.rdd.take(1)
And:
sample_data.take(1)
The first one produces a single-item list where the element is Row(...):
The other also produces a single-item list, but the item is a tuple:
With that in mind, let's now turn our attention to the code.
First, we load the necessary modules: to work with the Row(...) objects, we need pyspark.sql, and we will use the .round(...) method later, so we need the pyspark.sql.functions submodule.
Next, we extract .rdd from sample_data_df. Using the .map(...) transformation, we first add the HDD_size column to the schema.
Since we are working with RDDs, we want to retain all the other columns. Thus, we first convert the row (which is a Row(...) object) into a dictionary using the .asDict() method, so then we can later unpack it using **.
The second argument follows a simple convention: we pass the name of the column we want to create (the HDD_size), and set it to the desired value. In our first example, we split the .HDD column and extract the first element since it is HDD_size.
We repeat this step twice more: first, to create the HDD_type column, and second, to create the Volume column.
Next, we use the .toDF(...) method to convert our RDD back to a DataFrame. Note that you can still use the .toDF(...) method to convert a regular RDD (that is, where each element is not a Row(...) object) to a DataFrame, but you will you need to pass a list of column names to the .toDF(...) method or you end up with unnamed columns.
Finally, we .select(...) the columns so we can .round(...) the newly created Volume column. The .alias(...) method produces a different name for the resulting column.
The resulting DataFrame looks as follows:
Unsurprisingly, the desktop iMac would require the biggest box.