We have seen how to apply an operation to every row in a DataFrame to create a new column, and we have seen how to use filters to build new DataFrames with a sub-set of rows from the original DataFrame. The last set of operations on DataFrames is grouping operations, equivalent to the GROUP BY
statement in SQL. Let's calculate the average BMI for smokers and non-smokers. We must first tell Spark to group the DataFrame by a column (the isSmoker
column, in this case), and then apply an aggregation operation (averaging, in this case) to reduce each group:
scala> val smokingDF = readingsWithBmiDF.groupBy( "isSmoker").agg(avg("BMI")) smokingDF: org.apache.spark.sql.DataFrame = [isSmoker: boolean, AVG(BMI): double]
This has created a new DataFrame with two columns: the grouping column and the column over which we aggregated. Let's show this DataFrame:
scala> smokingDF.show +--------+------------------+ |isSmoker| AVG(BMI)| +--------+------------------+ | true|23.733355491389517| | false|23.095078245456424| +--------+------------------+
Besides averaging, there are several operators for performing the aggregation across each group. We outline some of the more important ones in the table below, but, for a full list, consult the Aggregate functions section of http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$:
Operator |
Notes |
---|---|
|
Group averages of the values in the specified column. |
|
Number of elements in each group in the specified column. |
|
Number of distinct elements in each group. This can also accept multiple columns to return the count of unique elements across several columns. |
|
First/last element in each group |
|
Largest/smallest element in each group |
|
Sum of the values in each group |
Each aggregation operator takes either the name of a column, as a string, or an expression of type Column
. The latter allows aggregation of compound expressions. If we wanted the average height, in meters, of the smokers and non-smokers in our sample, we could use:
scala> readingsDF.groupBy("isSmoker").agg { avg($"heightCm"/100.0) }.show +--------+-----------------------+ |isSmoker|AVG((heightCm / 100.0))| +--------+-----------------------+ | true| 1.715| | false| 1.6949999999999998| +--------+-----------------------+
We can also use compound expressions to define the column on which to group. For instance, to count the number of patients in each age
group, increasing by decade, we can use:
scala> readingsDF.groupBy(floor($"age"/10)).agg(count("*")).show +-----------------+--------+ |FLOOR((age / 10))|count(1)| +-----------------+--------+ | 4.0| 3| | 2.0| 1| +-----------------+--------+
We have used the short-hand "*"
to indicate a count over every column.