Monadic data transformation

The first step is to define a trait and method that describe the transformation of data by the computation units of a workflow. The data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results.

There are two symbolic models used for defining a data processing or data transformation:

  • Explicit model: The developer creates a model explicitly from a set of configuration parameters. Most of deterministic algorithms and unsupervised learning techniques use an explicit model.
  • Implicit model: The developer provides a training set that is a set of labeled observations (observations with an expected outcome). A classifier extracts a model through the training set. Supervised learning techniques rely on models implicitly generated from labeled data.

Error handling

The simplest form of data transformation is morphism between the two U and V types. The data transformation enforces a contract for validating an input and returning either a value or an error. From now on, we use the following convention:

  • Input value: The validation is implemented through a partial function of the PartialFunction type that is returned by the data transformation. A MatchErr error is thrown in case the input value does not meet the required condition (contract).
  • Output value: The type of a return value is Try[V] for which an exception is returned in case of an error.

Note

Reusability of partial functions

Reusability is another benefit of partial functions, which is illustrated in the following code snippet:

class F { 
  def f: PartialFunction[Int, Try[Double]] { case n: Int … 
  }
}
val pfn = (new F).f
pfn(4)
pfn(10)

Partial functions enable developers to implement methods that address the most common (primary) use cases for which input values have been tested. All other nontrivial use cases (or input values) generate a MatchErr exception. At a later stage in the development cycle, the developer can implement the code to handle the less common use cases.

Note

Runtime validation of a partial function

It is a good practice to validate if a partial function is defined for a specific value of the argument:

for {
  pfn.isDefinedAt(input)
  value <- pfn(input)
} yield { … }

This preemptive approach allows the developer to select an alternative method or a full function. It is an efficient alternative to catch a MathErr exception. The validation of a partial function is omitted throughout the book for the sake of clarity.

Therefore, the signature of a data transformation is defined as follows:

def |> : PartialFunction[U, Try[V]]

Note

F# language references

The |> notation used as the signature of the transform is borrowed from the F# language [2:2].

Explicit models

The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation. The transformation on a dataset is performed using a model or configuration that is fully defined by the user, which is illustrated in the following diagram:

Explicit models

Visualization of explicit models

The transformation of an explicit configuration or model, config, is defined as an ETransform abstract class parameterized by the T type of the model:

abstract class ETransform[T](val config: T) { //explicit model
  type U   // type of input
  type V   // type of output
  def |> : PartialFunction[U, Try[V]]  // data transformation
}

The input U type and output V type have to be defined in the subclasses of ETransform. The |> transform operator returns a partial function that can be reused for different input values.

The creation of a class that implements a specific transformation using an explicit configuration is quite simple: all you need is the definition of an input/output U/V type and an implementation of the |> transformation method.

Let's consider the extraction of data from a financial source, DataSource, that takes a list of functions that convert some text fields, Fields, into a Double value as the input and produce a list of observations of the XSeries[Double] type. The extraction parameters are defined in the DataSourceConfig class:

class DataSource(
  config: DataSourceConfig,   //1
  srcFilter: Option[Fields => Boolean]= None)
        extends ETransform[DataSourceConfig](config) { //2
  type U = List[Fields => Double]   //3
  type V = List[XSeries[Double]]     //4
  override def |> : PartialFunction[U, Try[V]] = { //5
    case u: U if(!u.isEmpty) => … 
  }
}

The DataSourceConfig configuration is explicitly provided as an argument of the constructor for DataSource (line 1). The constructor implements the basic type and data transformation associated with an explicit model (line 2). The class defines the U type of input values (line 3), V type of output values (line 4), and |> transformation method that returns a partial function (line 5).

Note

The DataSource class

The Data extraction section of the Appendix A, Basic Concepts describes the DataSource class functionality. The DataSource class is used throughout the book.

Data transformations using an explicit model or configuration constitute a category with monadic operations. The monad associated with the ETransform class subclasses the definition of the higher kind, _Monad:

private val eTransformMonad = new _Monad[ETransform] {
  override def unit[T](t:T) = eTransform(t)   //6
  override def map[T,U](m: ETransform[T])     //7
      (f: T => U): ETransform[U] = eTransform( f(m.config) )
  override def flatMap[T,U](m: ETransform[T])  //8
      (f: T =>ETransform[U]): ETransform[U] = f(m.config)
}

The singleton eTransformMonad implements the following basic monadic operators introduced in the Monads section under Abstraction in Chapter 1, Getting Started:

  • The unit method is used to instantiate ETransform (line 6)
  • The map is used to transform an ETransform object by morphing its elements (line 7)
  • The flatMap is used to transform an ETransform object by instantiating its elements (line 8)

For practical purposes, an implicit class is created to convert an ETransform object to its associated monad, allowing transparent access to the unit, map, and flatMap methods:

implicit class eTransform2Monad[T](fct: ETransform[T]) {
  def unit(t: T) = eTransformMonad.unit(t)
  final def map[U](f: T => U): ETransform[U] = 
      eTransformMonad.map(fct)(f)
  final def flatMap[U](f: T => ETransform[U]): ETransform[U] =
      eTransformMonad.flatMap(fct)(f)
}

Implicit models

Supervised learning models are extracted from a training set. Transformations, such as classification or regression use the implicit models to process the input data, as illustrated in the following diagram:

Implicit models

Visualization of implicit models

The transformation for a model implicitly extracted from the training data is defined as an abstract ITransform class parameterized by the T type of observations, xt:

abstract class ITransform[T](val xt: Vector[T]) { //Model input
   type V   // type of output
   def |> : PartialFunction[T, Try[V]]  // data transformation
}

The type of the data collection is Vector, which is an immutable and effective container. An ITransform type is created by defining the T type of the observation, the V output of the data transformation, and the |> method that implements the transformation, usually a classification or regression. Let' s consider the support vector machine algorithm, SVM, to illustrate the implementation of a data transformation using an implicit model:

class SVM[T <: AnyVal]( //9  
    config: SVMConfig, 
    xt: Vector[Array[T]], 
    expected: Vector[Double])(implicit f: T => Double)
  extends ITransform[Array[T]](xt) {//10

 type V = Double  //11
 override def |> : PartialFunction[Array[T], Try[V]] = { //12
     case x: Array[T] if(x.length == data.size) => ...
  }

The support vector machine is a discriminative supervised learning algorithm described in Chapter 8, Kernel Models and Support Vector Machines. A support vector machine, SVM, is instantiated with a configuration and training set: the xt observations and expected data (line 9). Contrary to the explicit model, the config configuration does not define the model used in the data transformation; the model is implicitly generated from the training set of the xt input data and expected values. An SVM instance is created as an ITransform (line 10) by specifying the V output type (line 11) and overriding the |> transformation method (line 12).

The |> classification method produces a partial function that takes an x observation as an input and returns the prediction value of a Double type.

Similar to the explicit transformation, we define the monadic operation for the ITransform by overriding the unit (line 13), map (line 14), and flatMap (line 15) methods:

private val iTransformMonad = new _Monad[ITransform] {
  override def unit[T](t: T) = iTransform(Vector[T](t))  //13
  
  override def map[T,U](m: ITransform[T])(f: T => U): 
ITransform[U] = iTransform( m.xt.map(f) )   //14
  
  override def flatMap[T,U](m: ITransform[T])  
    (f: T=>ITransform[U]): ITransform[U] = 
 iTransform(m.xt.flatMap(t => f(t).xt)) //15
}

Finally, let's create an implicit class to automatically convert an ITransform object into its associated monad so that it can access the unit, map, and flatMap monad methods transparently:

implicit class iTransform2Monad[T](fct: ITransform[T]) {
   def unit(t: T) = iTransformMonad.unit(t)
   
   final def map[U](f: T => U): ITransform[U] = 
      iTransformMonad.map(fct)(f)
   final def flatMap[U](f: T => ITransform[U]): ITransform[U] = 
      iTransformMonad.flatMap(fct)(f)
   def filter(p: T =>Boolean): ITransform[T] =  //16
      iTransform(fct.xt.filter(p))
}

The filter method is strictly not an operator of the monad (line 16). However, it is commonly included to constrain (or guard) a sequence of transformation (for example, for comprehension closure). As stated in the Presentation section under Source code in Chapter 1, Getting Started, code related to exceptions, error checking, and validation of arguments is omitted.

Note

Immutable transformations

The model for a data transformation (or a processing unit or classifier) class should be immutable. Any modification will alter the integrity of the model or parameters used to process data. In order to ensure that the same model is used in processing the input data for the entire lifetime of a transformation, we do the following:

  • A model for an ETransform is defined as an argument of its constructor.
  • The constructor of an ITransform generates the model from a given training set. The model has to be rebuilt from the training set (not altered), if it provides an incorrect outcome or prediction.

Models are created by the constructor of classifiers or data transformation classes to ensure their immutability. The design of an immutable transformation is described in the Design template for immutable classifiers section under Scala programming of the Appendix A, Basic Concepts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset