Spark 2.x supports a different way of defining schema for complex data types. First, let's look at a simple example.
Encoders must be imported using the import statement in order for you to use Encoders:
import org.apache.spark.sql.Encoders
Let's look at a simple example of defining a tuple as a data type to be used in the dataset APIs:
scala> Encoders.product[(Integer, String)].schema.printTreeString
root
|-- _1: integer (nullable = true)
|-- _2: string (nullable = true)
The preceding code looks complicated to use all the time, so we can also define a case class for our need and then use it. We can define a case class Record with two fields-an Integer and a String:
scala> case class Record(i: Integer, s: String)
defined class Record
Using Encoders , we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:
scala> Encoders.product[Record].schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)
All the data types of Spark SQL are located in the package org.apache.spark.sql.types. You can access them by doing:
import org.apache.spark.sql.types._
You should use the DataTypes object in your code to create complex Spark SQL types such as arrays or maps, as follows:
scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes
scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,true)
The following are the data types supported in Spark SQL APIs:
Data type | Value type in Scala | API to access or create a data type |
ByteType | Byte | ByteType |
ShortType | Short | ShortType |
IntegerType | Int | IntegerType |
LongType | Long | LongType |
FloatType | Float | FloatType |
DoubleType | Double | DoubleType |
DecimalType | java.math.BigDecimal | DecimalType |
StringType | String | StringType |
BinaryType | Array[Byte] | BinaryType |
BooleanType | Boolean | BooleanType |
TimestampType | java.sql.Timestamp | TimestampType |
DateType | java.sql.Date | DateType |
ArrayType | scala.collection.Seq | ArrayType(elementType, [containsNull]) |
MapType | scala.collection.Map | MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true. |
StructType | org.apache.spark.sql.Row |
StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |