In this section, we will use the Dataset API in an immutable way. We will cover the following topics:
- Dataset immutability
- Creating two leaves from the one root dataset
- Adding a new column by issuing transformation
The test case for the dataset is quite similar, but we need to do a toDS() for our data to be type safe. The type of dataset is userData, as shown in the following example:
import com.tomekl007.UserData
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
class ImmutableDataSet extends FunSuite {
val spark: SparkSession = SparkSession
.builder().master("local[2]").getOrCreate()
test("Should use immutable DF API") {
import spark.sqlContext.implicits._
//given
val userData =
spark.sparkContext.makeRDD(List(
UserData("a", "1"),
UserData("b", "2"),
UserData("d", "200")
)).toDF()
Now, we will issue a filter of userData and specify isin, as shown in the following example:
//when
val res = userData.filter(userData("userId").isin("a"))
It will return the result (res), which is a leaf with our 1 element. userData will still have 3 elements because of this apparent root. Let's execute this program, as shown in the following example:
assert(res.count() == 1)
assert(userData.count() == 3)
}
}
We can see our test passed, which means that the dataset is also an immutable abstraction on top of the DataFrame, and employs the same characteristics. userData has a something very useful known as a typeset, and, if you use the show() method, it will infer the schema and know that the "a" field is a string or another type, as shown in the following example:
userData.show()
The output will be as follows:
+------+----+
|userId|data|
|----- |----|
| a| 1|
| b| 2|
| d| 200|
+------|----+
In the preceding output, we have both userID and data fields.