How it works...

If you have read the previous chapter, you probably already know how to create RDDs. In this example, we simply call the sc.parallelize(...) method. 

Our sample dataset contains just a handful of records of the relatively recent Apple computers. However, as with all RDDs, it is hard to figure out what each element of the tuple stands for since RDDs are schema-less structures.

Therefore, when using the .createDataFrame(...) method of SparkSession, we pass a list of column names as the second argument; the first argument is the RDD we wish to transform into a DataFrame.

Now, if we peek inside the sample_data RDD using sample_data.take(1), we will retrieve the first record:

To compare the content of a DataFrame, we can run sample_data_df.take(1) to get the following:

As you can now see, a DataFrame is a collection of Row(...) objects. A Row(...) object consists of data that is named, unlike an RDD. 

If the preceding Row(...) object looks similar to a dictionary to you, you are not wrong. Any Row(...) object can be converted into a dictionary using the .asDict(...) method. For more information, check out http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row.

If, however, we were to have a look at the data within the sample_data_df DataFrame, using the .show(...) method, we would see the following:

Since DataFrames have schema, let's see the schema of our sample_data_df using the .printSchema() method:

As you can see, the columns in our DataFrame have the datatypes matching the datatypes of the original sample_data RDD.

Even though Python is not a strongly-typed language, DataFrames in PySpark are. Unlike RDDs, every element of a DataFrame column has a specified type (these are all listed in the pyspark.sql.types submodule) and all the data must conform to the specified schema. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset