Similar to Scala, we can perform a SQL query on the DataFrame once it is saved as TempView using the createOrReplaceTempView() method. Let's see an example of that. At first, let's save the fight DataFrame (that is, flightDF) as follows:
# First, register the flights SparkDataFrame as a table
createOrReplaceTempView(flightDF, "flight")
Now let's select destination and destinations of all the flights with their associated carrier information as follows:
destDF<- sql("SELECT dest, origin, carrier FROM flight")
showDF(destDF, numRows=10)
The output is as follows:
Now let's make the SQL a bit more complex, such as finding the destination's airport of all the flights that are at least 120 minutes delayed as follows:
selected_flight_SQL<- sql("SELECT dest, origin, arr_delay FROM flight WHERE arr_delay>= 120")
showDF(selected_flight_SQL, numRows = 10)
The preceding code segment queries and shows the name of the airports of all the flights that are delayed by at least 2 hours:
Now let's do a more complex query. Let's find the origins of all the flights to Iowa that are delayed by at least 2 hours. Finally, sort them by arrival delay and limit the count up to 20 as follows:
selected_flight_SQL_complex<- sql("SELECT origin, dest, arr_delay FROM flight WHERE dest='IAH' AND arr_delay>= 120 ORDER BY arr_delay DESC LIMIT 20")
showDF(selected_flight_SQL_complex, numRows=20)
The preceding code segment queries and shows the name of the airports of all the flights that are delayed by at least 2 hours to Iowa: