How it works...

First, we import the linear algebra portion of MLlib; we will be using some matrix representations later. 

Next, we build a pivot table where we group by the occupation feature and pivot by the label column (either <=50K or >50K). Each occurrence is counted and this results in the following table:

Next, we flatten the output by accessing the underlying RDD and selecting only the counts with the map transformation: .map(lambda row: (row[1:])). The .flatMap(...) transformation creates a long list of all the values we need. We collect all the data on the driver so we can later create DenseMatrix.

You should be cautious about using the .collect(...) action since it brings all the data to the driver. As you can see, we are only bringing the heavily aggregated representation of our dataset.

Once we have all our numbers on the driver, we can create their matrix representation; we will have a matrix of 15 rows and 2 columns. First, we check how many distinct occupation values there are by checking the count of the census_occupation elements. Next, we call the DenseMatrix(...) constructor to create our matrix. The first parameter specifies the number of rows, the second one the number of columns. The third parameter specifies the data, and the final one indicates whether the data is transposed or not. The dense representation looks as follows:

And in a more readable format (as a NumPy matrix), it looks like this:

Now, we simply call the .chiSqTest(...) and pass our matrix as its only parameter. What is left is to check pValue and whether nullHypothesis was rejected or not:

So, as you can see, pValue is 0.0, so we can reject the null hypothesis that states the distribution of occupation between those that earn more than $50,000 versus those that earn less than $50,000 is the same. Thus, we can conclude, as Spark tells us, that the occurrence of the outcomes is statistically independent, that is, occupation should be a strong indicator for someone who earns more than $50,000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset