Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

How it works...

First, we load the SQL module of PySpark.

Next, we read the DataFrames_sample.csv file using the .textFile(...) method of SparkContext.

Review the previous chapter if you do not yet know how to read data into RDDs.

The resulting RDD looks as follows:

As you can see, the RDD still contains the row with column names. In order to get rid of it, we first extract it using the .first() method and then later using the .filter(...) transformation to remove any row that is equal to the header.

Next, we split each row with a comma and create a Row(...) object for each observation. Note here that we convert all of the fields to the proper datatypes. For example, the Id column should be an integer, the Model name is a string, and W (width) is a float.

Finally, we simply call the .createDataFrame(...) method of SparkSession to convert our RDD of Row(...) objects into a DataFrame. Here's the final result:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...