Profiling data

The selection of a preprocessing, clustering, or classification algorithm depends highly on the quality and profile of the input data (observations and expected values whenever available). The Step 3 – preprocessing the data section under Let's kick the tires in Chapter 1, Getting Started, introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean is defined as:

Immutable statistics

Variance is defined as:

Immutable statistics

Variance adjusted for a sampling bias is defined as:

Immutable statistics

Let's extend the MinMax class with some basic statistics capabilities using Stats:

class Stats[T < : AnyVal](
     values: Vector[T])(implicit f ; T => Double)
  extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums = values./:(zero)((s,x) =>(s._1 +x, s._2 + x*x)) //1
  
  lazy val mean = sums._1/values.size  //2
  lazy val variance = 
         (sums._2 - mean*mean*values.size)/(values.size-1)
  lazy val stdDev = Math.sqrt(variance)
  …
}

The Stats class implements immutable statistics. Its constructor computes the sum of values and sum of square values, sums (line 1). The statistics such as mean and variance are computed once when needed by declaring these values as lazy (line 2). The Stats class inherits the normalization functions of MinMax.

Z-Score and Gauss

The Gaussian distribution of the input data is implemented by the gauss method of the Stats class.

Note

The Gaussian distribution

M1: Gaussian for a mean μ and a standard deviation σ transformation is defined as:

Z-Score and Gauss

The code is as follows:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
   val y = (x - mu)/sigma
   INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)

The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.

Note

Z-score normalization

M2: Z-score for a mean μ and a standard deviation σ is defined as:

Z-Score and Gauss

The computation of the Z-score is implemented by the zScore method of Stats:

def zScore: DblVector = values.map(x => (x - mean)/stdDev )

The following graph illustrates the relative behavior of the zScore normalization and normal transformation:

Z-Score and Gauss

A comparative analysis of linear, Gaussian, and Z-score normalization

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset