First, we load the SQL module of PySpark.
Next, we read the DataFrames_sample.csv file using the .textFile(...) method of SparkContext.
The resulting RDD looks as follows:
As you can see, the RDD still contains the row with column names. In order to get rid of it, we first extract it using the .first() method and then later using the .filter(...) transformation to remove any row that is equal to the header.
Next, we split each row with a comma and create a Row(...) object for each observation. Note here that we convert all of the fields to the proper datatypes. For example, the Id column should be an integer, the Model name is a string, and W (width) is a float.
Finally, we simply call the .createDataFrame(...) method of SparkSession to convert our RDD of Row(...) objects into a DataFrame. Here's the final result: