Recipe 3-1. Read a CSV file
Recipe 3-2. Read a JSON file
Recipe 3-3. Save a DataFrame as a CSV file
Recipe 3-4. Save a DataFrame as a JSON file
Recipe 3-5. Read ORC files
Recipe 3-6. Read a Parquet file
Recipe 3-7. Save a DataFrame as an ORC file
Recipe 3-8. Save a DataFrame as a Parquet file
Recipe 3-9. Read data from MySQL
Recipe 3-10. Read data from PostgreSQL
Recipe 3-11. Read data from Cassandra
Recipe 3-12. Read data from MongoDB
Recipe 3-13. Save a DataFrame to MySQL
Recipe 3-14. Save a DataFrame to PostgreSQL
Recipe 3-15. Save a DataFrame to MongoDB
Recipe 3-16. Read data from Apache Hive
Recipe 3-1. Read a CSV File
Problem
You want to read a CSV (comma-separated value) file.
Solution
This CSV file shown in Figure 3-1 is named swimmerData.csv . This tabular form has four columns—id, Gender, Occupation, and swimTimeInSecond. The first three columns contain data in the String datatype. The last column uses the float or double datatype.
We can read a CSV file using the spark.read.csv() function . Here, spark is object of the SparkSession class. This function has many arguments but we are going to discuss the more important ones. We specify the path of the CSV file, which has to be read by the path argument. The PySpark SQL DataFrame has a tabular structure similar to RDBMS tables. Therefore, we have to specify the schema of the DataFrame. We specify the schema of the DataFrame using the second argument of the csv() function, that is, schema.
The CSV name is misleading, because these files might have any character as a data field separator. We specify the data field separator using the argument sep. If the file has a header, we can specify it using the header argument. If the value of the argument header is None, this indicates that there is no header in the file. But if there is a header, we must set the value of the header argument to True. The function is smart and can infer the schema if the inferSchema argument has been set to True.
How It Works
Let’s start by creating the schema of the DataFrame.
Step 3-1-1. Creating the Schema of the DataFrame
The schema of the four columns of the DataFrame can be observed by using the swimmerDfSchema variable.
Step 3-1-2. Reading a CSV File
We read the file and created the DataFrame.
Step 3-1-3. Reading a CSV File Using the Argument inferSchema Value as True
PySpark SQL has inferred the schema correctly.
Recipe 3-2. Read a JSON File
Problem
You want to read a JSON (JavaScript Object Notation) file.
Solution
Imagine this data in tabular form. We can find that the data has three columns called iv1, iv2, and iv3. Each column has decimal or floating point values.
The spark.read.json() function will read JSON files. Like the csv() function, there are many arguments in the json() function. The first argument is path and it determines the location of the file to be read. The second argument is schema and it determines the schema of the DataFrame to be created.
How It Works
We have created the DataFrame successfully.
And everything is as expected, as the datatype of each column is double or DoubleType() . So, while reading a JSON file, if we do not provide the schema, PySpark SQL will infer it.
Recipe 3-3. Save a DataFrame as a CSV File
Problem
You want to save the contents of a DataFrame to a CSV file.
Solution
Whenever we think to write or save a DataFrame to an external storage system, we can use the DataFrameWriter class and the methods defined in it. We can access DataFrameWriter using DataFrame.write. So, if we want to save our DataFrame as a CSV file, we have to use the DataFrame.write.csv() function.
Similar to the spark.read.csv() function, the DataFrame.write.csv() function has many arguments. Let’s discuss its three arguments—path, sep, and header. The path argument defines the directory where DataFrame will be written. We can specify the data field separator using the sep argument. If the value of the header argument is True, the header of the DataFrame will be written as the first line in the CSV file.
We are going to write the corrData DataFrame into the csvFileDir directory. We created the corrData DataFrame in Recipe 3-2.
Note
You can read more about the DataFrameWriter class from the following link
https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter
How It Works
Now it is confirmed that the contents have been written to a CSV file.
Recipe 3-4. Save a DataFrame as a JSON File
Problem
You want to save the contents of a DataFrame as a JSON file.
Solution
In order to save a DataFrame as a JSON file, we are going to use the DataFrameWriter class function called json(). We are going to save the swimmerDf DataFrame as a JSON file.
How It Works
The file called part-00000-51e76de8-127f-4549-84a8-7ea972632a4d-c000.json contains the swimmerDf DataFrame data.
PySpark SQL saved the DataFrame’s contents into the JSON file correctly.
Recipe 3-5. Read ORC Files
Problem
You want to read an ORC (Optimized Row Columnar) file.
Solution
The table data shown in Figure 3-2 is in an ORC file in the duplicateData directory. We have to read it.
We are going to read the ORC file using the spark.read.orc() function. Remember that spark is the object of the SparkSession class.
Note
More about ORC files can be found from the following link
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
How It Works
Recipe 3-6. Read a Parquet File
Problem
You want to read a Parquet file.
Solution
Apache Parquet is an open source file format developed for Apache Hadoop. It uses a columnar data storage format. It provides encoding to the data and efficient data compression.
We have a Parquet file in a directory called temperatureData. We can read the Parquet file using the spark.read.parquet() function of PySpark SQL.
Note
You can read more about Parquet files from the following links
How It Works
We can see that we have the DataFrame.
Recipe 3-7. Save a DataFrame as an ORC File
Problem
You want to save a DataFrame as an ORC file.
Solution
We are going to save the swimmerDf DataFrame to an ORC file. We are going to save it in a directory called orcData. We are going to use the orc() function of the DataFrameWriter class to save the DataFrame as an ORC file.
How It Works
Recipe 3-8. Save a DataFrame as a Parquet File
Problem
You want to save a DataFrame as a Parquet file.
Solution
We created the duplicateDataDf DataFrame in Recipe 3-5. We are going to use the parquet() function of the DataFrameWrite class.
How It Works
Data has been written to the file called part-00000-d44eb594-46da-495a-88e9-934ca1eff270-c000.snappy.parquet.
Recipe 3-9. Read Data from MySQL
Problem
You want to read a table of data from MySQL.
Solution
How It Works
Recipe 3-10. Read Data from PostgreSQL
Problem
You want to read a table of data from a PostgreSQL database.
Solution
Now in the PySpark shell, we can write commands to read data from the PostgreSQL database.
How It Works
Let’s read the data table from PostgreSQL and create the DataFrames.
We have read the firstverticaltable table and created the verticalDfOne DataFrame.
Recipe 3-11. Read Data from Cassandra
Problem
You want to read a table of data from a Cassandra database.
Solution
Note
You can read more about connecting PySpark SQL to Cassandra on the following web page
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md
How It Works
Recipe 3-12. Read Data from MongoDB
Problem
You want to read a collection of data from MongoDB.
Solution
We have printed the first five records from our restaurantSurvey collection. We have to read this collection data using PySpark SQL and create a DataFrame.
Note
You can read about reading data from MongoDB in the following links
https://docs.mongodb.com/spark-connector/master/python-api/
https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
How It Works
We have read our data.
Recipe 3-13. Save a DataFrame to MySQL
Problem
You want to save the contents of a DataFrame to MySQL.
Solution
In order to save a DataFrame to MySQL, we are going to use the write() function of the DataFrameWriter class.
How It Works
Step 3-13-1. Creating a DataFrame
So the ourDf DataFrame is ready to be saved.
Step 3-13-2. Saving the DataFrame into a MySQL Database
We are successful.
Recipe 3-14. Save a DataFrame to PostgreSQL
Problem
You want to save a DataFrame into PostgreSQL.
Solution
Saving DataFrame content to PostgreSQL requires a PostgreSQL JDBC connector.
How It Works
Step 3-14-1. Creating a DataFrame
Step 3-14-2. Saving a DataFrame into a PostgreSQL Database
Recipe 3-15. Save DataFrame Contents to MongoDB
Problem
You want to save DataFrame contents as a collection in MongoDB.
Solution
In order to save the DataFrame to MongoDB, we have to use the connector associated with it.
Note
You can read more about writing data to MongoDB using PySpark from the following link.
https://docs.mongodb.com/spark-connector/master/python/write-to-mongodb/
How It Works
Finally, we have our DataFrame content in MongoDB.
Recipe 3-16. Read Data from Apache Hive
Problem
You want to read a table of data from Apache Hive.
Solution
We have displayed some rows of the filamenttable table. We have to read this table data using PySpark SQL.
We can read the table from Hive using PySpark SQL and the spark.table() function.
How It Works
And finally we created a DataFrame from the table in Apache Hive.