Count is the most basic aggregate function, which simply counts the number of rows for the column specified. An extension is the countDistinct, which also eliminates duplicates.
The count API has several implementations, as follows. The exact API used depends on the specific use case:
def count(columnName: String): TypedColumn[Any, Long]
Aggregate function: returns the number of items in a group.
def count(e: Column): Column
Aggregate function: returns the number of items in a group.
def countDistinct(columnName: String, columnNames: String*): Column
Aggregate function: returns the number of distinct items in a group.
def countDistinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.
Let's look at examples of invoking count and countDistinct on the DataFrame to print the row counts:
import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(col("*")).agg(count("State")).show
scala> statesPopulationDF.select(count("State")).show
+------------+
|count(State)|
+------------+
| 350|
+------------+
scala> statesPopulationDF.select(col("*")).agg(countDistinct("State")).show
scala> statesPopulationDF.select(countDistinct("State")).show
+---------------------+
|count(DISTINCT State)|
+---------------------+
| 50|