Dataset from string and typed class

As already mentioned that the Dataset is a typed and immutable collection of objects. Datasets are basically mapped to a relational schema. With the Dataset abstraction, a new concept has been brought in Spark called an encoder. The encoder helps in entity conversion for example conversion between the JVM objects and the corresponding tabular representation. You will find this API quite similar to RDDs transformations such as map, mapToPair, flatMap or filter.

We will show the spam filter example using Datasets API in the following section. It reads the text file using and returns a Dataset as a tabular format. Then perform map transformation like RDDs for making (label, tokens) columns with adding an additional encoder parameter. Here, we have used the bean encoder with SMSSpamTokenizedBean class.

In this sub-section, we will show how to create Dataset from string and typed class SMSSpamTokenizedBean. Let's create the Spark session at first place as follows:

static SparkSession spark = SparkSession.builder() 
      .appName("DatasetDemo") 
      .master("local[*]") 
      .config("spark.sql.warehouse.dir", "E:/Exp/") 
      .getOrCreate(); 

Now create a new Dataset of type String from the smm filtering Dataset that means Dataset<String> and show the result as follows:

Dataset<String> ds = spark.read().text("input/SMSSpamCollection.txt").as(org.apache.spark.sql.Encoders.STRING()); 
ds.show(); 

Here is the output of the preceding code:

Dataset from string and typed class

Figure 16: Showing the snapshot of the spam filtering dataset using Dataset

Now let's create a second Dataset from the typed class SMSSpamTokenizedBean by mapping the Dataset of string we created immediate before as follows:

Dataset<SMSSpamTokenizedBean> dsSMSSpam = ds.map( 
new MapFunction<String, SMSSpamTokenizedBean>() { 
          @Override 
public SMSSpamTokenizedBean call(String value) throws Exception { 
      String[] split = value.split("	"); 
      double label; 
      if(split[0].equalsIgnoreCase("spam")) 
          label = 1.0; 
      else 
          label=0.0; 
ArrayList<String> tokens = new ArrayList<>(); 
  for(String s:split) 
    tokens.add(s.trim());           
      return new SMSSpamTokenizedBean(label, tokens.toString()); 
         } 
}, org.apache.spark.sql.Encoders.bean(SMSSpamTokenizedBean.class)); 

Now let's print the Dataset along with its schema as follows:

dsSMSSpam.show(); 
dsSMSSpam.printSchema(); 

The following output is:

Dataset from string and typed class

Figure 17: Showing the token and label and the lower side the schema

Now if you would like to convert this typed Dataset as type Row then you can use the toDF() method and to further create a temporary view out of the new Dataset<Row> you can use the createOrReplaceTempView() method with ease as follows:

Dataset<Row> df = dsSMSSpam.toDF(); 
df.createOrReplaceTempView("SMSSpamCollection");      

Similarly, might want to view the same Dataset by calling show method() as follows:

df.show(); 

Output:

Dataset from string and typed class

Figure 18: Corresponding labels and tokens. Labels are converted into double value

Now let's explore the typed class SMSSpamTokenizedBean. The class works as a Java tokenized bean class for the labeling the texts. More technically, the class takes the input then it sets the labels and after that gets the labels. Secondly, it also sets and gets the token for spam filtering. Including the setter and methods, here is the class:

public class SMSSpamTokenizedBean implements Serializable { 
private Double labelDouble; 
private String tokens;     
public SMSSpamTokenizedBean(Double labelDouble, String tokens) { 
  super(); 
  this.labelDouble = labelDouble; 
  this.tokens = tokens; 
  } 
  public Double getLabelDouble() { 
    return labelDouble; 
  } 
  public void setLabelDouble(Double labelDouble) { 
    this.labelDouble = labelDouble; 
  } 
  public String getTokens() { 
    return tokens; 
  } 
  public void setTokens(String tokens) { 
    this.tokens = tokens; 
  }} 

Comparison between RDD, DataFrame and Dataset

There are some objectives to bring Dataset as a new Data Structure of Spark. Although RDD API is very flexible, it is sometimes harder to optimize the processing. On the other hand, the DataFrame API is very easier to optimize but it lacks some of the nice features of RDD. So, the goal of the Datasets is to allow the users to easily express transformations on objects and also providing the advantages (performance and robustness) of the Spark SQL execution engine.

The Dataset can perform many operations such as sorting or shuffling without de-serializing of an object. For doing this it requires an explicit Encoder that is used to serialize the object into a binary format. It is capable of mapping the schema of a given object (Bean) to the Spark SQL type system. On the other hand, RDDs are based on run-time reflection based serialisation and the operations that change the types of object of a Dataset also need an encoder for the new type.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset