Encoders

Spark 2.x supports a different way of defining schema for complex data types. First, let's look at a simple example.

Encoders must be imported using the import statement in order for you to use Encoders:

import org.apache.spark.sql.Encoders

Let's look at a simple example of defining a tuple as a data type to be used in the dataset APIs:


scala> Encoders.product[(Integer, String)].schema.printTreeString
root
 |-- _1: integer (nullable = true)
 |-- _2: string (nullable = true)

The preceding code looks complicated to use all the time, so we can also define a case class for our need and then use it. We can define a case class Record with two fields-an Integer and a String:

scala> case class Record(i: Integer, s: String)
defined class Record

Using Encoders , we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:

scala> Encoders.product[Record].schema.printTreeString
root
 |-- i: integer (nullable = true)
 |-- s: string (nullable = true)

All the data types of Spark SQL are located in the package org.apache.spark.sql.types. You can access them by doing:

import org.apache.spark.sql.types._

You should use the DataTypes object in your code to create complex Spark SQL types such as arrays or maps, as follows:

scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,true)

The following are the data types supported in Spark SQL APIs:

Data type	Value type in Scala	API to access or create a data type
`ByteType`	`Byte`	`ByteType`
`ShortType`	`Short`	`ShortType`
`IntegerType`	`Int`	`IntegerType`
`LongType`	`Long`	`LongType`
`FloatType`	`Float`	`FloatType`
`DoubleType`	`Double`	`DoubleType`
`DecimalType`	`java.math.BigDecimal`	`DecimalType`
`StringType`	`String`	`StringType`
`BinaryType`	`Array[Byte]`	`BinaryType`
`BooleanType`	`Boolean`	`BooleanType`
`TimestampType`	`java.sql.Timestamp`	`TimestampType`
`DateType`	`java.sql.Date`	`DateType`
`ArrayType`	`scala.collection.Seq`	`ArrayType(elementType, [containsNull])`
`MapType`	`scala.collection.Map`	`MapType(keyType, valueType, [valueContainsNull])` Note: The default value of `valueContainsNull` is `true`.
`StructType`	`org.apache.spark.sql.Row`	`StructType(fields)` Note: fields is a `Seq` of `StructFields`. Also, two fields with the same name are not allowed.

Table of Contents for Encoders

Create new playlist

Sign In

Sign Up

Table of Contents for
Encoders