A schema is described using StructType, which is a collection of StructField objects.
StructType and StructField belong to the org.apache.spark.sql.types package.
DataTypes such as IntegerType, StringType also belong to the org.apache.spark.sql.types package.
DataTypes such as IntegerType, StringType also belong to the org.apache.spark.sql.types package.
Using these imports, we can define a custom explicit schema.
First, import the necessary classes:
scala> import org.apache.spark.sql.types.{StructType, IntegerType, StringType}
import org.apache.spark.sql.types.{StructType, IntegerType, StringType}
Define a schema with two columns/fields-an Integer followed by a String:
scala> val schema = new StructType().add("i", IntegerType).add("s", StringType)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(s,StringType,true))
It's easy to print the newly created schema:
scala> schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)
There is also an option to print JSON, which is as follows, using prettyJson function:
scala> schema.prettyJson
res85: String =
{
"type" : "struct",
"fields" : [ {
"name" : "i",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "s",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
All the data types of Spark SQL are located in the package org.apache.spark.sql.types. You can access them by doing:
import org.apache.spark.sql.types._