SparkDataFrame is a distributed collection of rows under named columns. Less technically, it can be considered as a table in a relational database with column headers. Furthermore, PySpark DataFrame is similar to Python pandas. However, it also shares some mutual characteristics with RDD:
- Immutable: Just like an RDD, once a DataFrame is created, it can't be changed. We can transform a DataFrame to an RDD and vice versa after applying transformations.
- Lazy Evaluations: Its nature is a lazy evaluation. In other words, a task is not executed until an action is performed.
- Distributed: Both the RDD and DataFrame are distributed in nature.
Just like Java/Scala's DataFrames, PySpark DataFrames are designed for processing a large collection of structured data; you can even handle petabytes of data. The tabular structure helps us understand the schema of a DataFrame, which also helps optimize execution plans on SQL queries. Additionally, it has a wide range of data formats and sources.
You can create RDDs, datasets, and DataFrames in a number of ways using PySpark. In the following subsections, we will show some examples of doing that.