Numerical features

Let's explore the numerical features. Just like in Chapter 4Preparing Data for Modeling, for the numerical variables, we will calculate some basic descriptive statistics:

import pyspark.mllib.stat as st
import numpy as np

rdd_num = ( census_subset .select(cols_num) .rdd .map(lambda row: [e for e in row]) ) stats_num = st.Statistics.colStats(rdd_num) for col, min_, mean_, max_, var_ in zip( cols_num , stats_num.min() , stats_num.mean() , stats_num.max() , stats_num.variance() ): print('{0}: min->{1:.1f}, mean->{2:.1f}, max->{3:.1f}, stdev->{4:.1f}' .format(col, min_, mean_, max_, np.sqrt(var_)))

First, we further subset our census_subset to include only the numerical columns. Next, we extract the underlying RDD. Since every element of this RDD is a row, we first need to create a list so we can work with it; we achieve that using the .map(...) method.

Now that we have our RDD ready, we simply call the .colStats(...) method from the statistics module of MLlib. .colStats(...) accepts an RDD of numeric values; these can be either lists or vectors (either dense or sparse, see the documentation on pyspark.mllib.linalg.Vectors at http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors). A MultivariateStatisticalSummary trait is returned, which contains data such as count, max, mean, min, norms L1 and L2, number of nonzero observations, and the variance.

If you are familiar with C++ or Java, traits can be viewed as virtual classes (C++) or interfaces (Java). You can read more about traits at https://docs.scala-lang.org/tour/traits.html.

In our example, we only select the min, mean, max, and variance. Here's what we get back:

So, the average age is about 39 years old. However, we definitely have an outlier in our dataset of 90 years old. In terms of capital gain or loss, the census respondents seem to be making more money than losing. On average, the respondents worked 40 hours per week but we had someone working close to 100-hour weeks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset