How it works...

First, we load the SQL module of PySpark. 

Next, we read the DataFrames_sample.csv file using the .textFile(...) method of SparkContext.

Review the previous chapter if you do not yet know how to read data into RDDs.

The resulting RDD looks as follows:

As you can see, the RDD still contains the row with column names. In order to get rid of it, we first extract it using the .first() method and then later using the .filter(...) transformation to remove any row that is equal to the header.

Next, we split each row with a comma and create a Row(...) object for each observation. Note here that we convert all of the fields to the proper datatypes. For example, the Id column should be an integer, the Model name is a string, and W (width) is a float. 

Finally, we simply call the .createDataFrame(...) method of SparkSession to convert our RDD of Row(...) objects into a DataFrame. Here's the final result:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset