The selection of a preprocessing, clustering, or classification algorithm depends highly on the quality and profile of the input data (observations and expected values whenever available). The Step 3 – preprocessing the data section under Let's kick the tires in Chapter 1, Getting Started, introduced the MinMax
class for normalizing a dataset using the minimum and maximum values.
The mean and standard deviation are the most commonly used statistics.
Let's extend the MinMax
class with some basic statistics capabilities using Stats
:
class Stats[T < : AnyVal]( values: Vector[T])(implicit f ; T => Double) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums = values./:(zero)((s,x) =>(s._1 +x, s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values.size)/(values.size-1) lazy val stdDev = Math.sqrt(variance) … }
The Stats
class implements immutable statistics. Its constructor computes the sum of values
and sum of square values, sums
(line 1
). The statistics such as mean
and variance
are computed once when needed by declaring these values as lazy (line 2
). The Stats
class inherits the normalization functions of MinMax
.
The Gaussian distribution of the input data is implemented by the gauss
method of the Stats
class.
The code is as follows:
def gauss(mu: Double, sigma: Double, x: Double): Double = { val y = (x - mu)/sigma INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma } val normal = gauss(1.0, 0.0, _: Double)
The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.
The computation of the Z-score is implemented by the zScore
method of Stats
:
def zScore: DblVector = values.map(x => (x - mean)/stdDev )
The following graph illustrates the relative behavior of the zScore
normalization and normal transformation: