If you have read the previous chapter, you probably already know how to create RDDs. In this example, we simply call the sc.parallelize(...) method.
Our sample dataset contains just a handful of records of the relatively recent Apple computers. However, as with all RDDs, it is hard to figure out what each element of the tuple stands for since RDDs are schema-less structures.
Therefore, when using the .createDataFrame(...) method of SparkSession, we pass a list of column names as the second argument; the first argument is the RDD we wish to transform into a DataFrame.
Now, if we peek inside the sample_data RDD using sample_data.take(1), we will retrieve the first record:
To compare the content of a DataFrame, we can run sample_data_df.take(1) to get the following:
As you can now see, a DataFrame is a collection of Row(...) objects. A Row(...) object consists of data that is named, unlike an RDD.
If, however, we were to have a look at the data within the sample_data_df DataFrame, using the .show(...) method, we would see the following:
Since DataFrames have schema, let's see the schema of our sample_data_df using the .printSchema() method:
As you can see, the columns in our DataFrame have the datatypes matching the datatypes of the original sample_data RDD.