How it works...

The preceding code is pretty self-explanatory. First, we select the feature of interest (in our case, the fuel economy).

The Spark DataFrames do not have a native histogram method, so that's why we switch to the underlying RDD.

Next, we flatten our results into a long list (instead of a Row object) and use the .histogram(...) method to calculate our histogram.

The .histogram(...) method accepts either an integer that would specify the number of buckets to allocate our data to or a list with a specified bucket limit.

Check out PySpark's documentation on the .histogram(...) at https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.histogram.

The method returns a tuple of two elements: the first element is a list of bin bounds, and the other element is the counts of elements in the corresponding bins. Here's what this looks like for our fuel economy feature:

Note that we specified that we want the .histogram(...) method to bucketize our data into five bins, but there are six elements in the first list. However, we still have five buckets in our dataset: [8.97, 12.38), [ 12.38, 15.78), [15.78, 19.19), [19.19, 22.59), and [22.59, 26.0).

We cannot create any plots in PySpark natively without going through a lot of setting up (see, for example, this: https://plot.ly/python/apache-spark/). The easier way is to prepare a DataFrame with our data and use some magic (well, sparkmagics, but it still counts!) locally on the driver.

First, we need to extract our data and create a temporary histogram_MPG table:

(
    spark
    .createDataFrame(
        [(bins, counts) 
         for bins, counts 
         in zip(
             histogram_MPG[0], 
             histogram_MPG[1]
         )]
        , ['bins', 'counts']
    )
    .registerTempTable('histogram_MPG')
)

We create a two-column DataFrame where the first column contains the bin lower bound and the second column contains the corresponding count. The .registerTempTable(...) method (as the name suggests) registers a temporary table so we can actually use it with the %%sql magic:

%%sql -o hist_MPG -q
SELECT * FROM histogram_MPG

The preceding command selects all the records from our temporary histogram_MPG table and outputs it to the locally-accessible hist_MPG variable; the -q switch is there so nothing gets printed out to the notebook.

With hist_MPG locally accessible, we can now use it to produce our plot:

%%local
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(1, 1, 1)
ax.bar(hist_MPG['bins'], hist_MPG['counts'], width=3)
ax.set_title('Histogram of fuel economy')

%%local executes whatever is located in that notebook cell in local mode. First, we import the matplotlib library and specify that it produces the plots inline within the notebook instead of popping up a new window each time a plot is produced. plt.style.use(...) changes the styles of our charts.

For a full list of available styles, check out https://matplotlib.org/devdocs/gallery/style_sheets/style_sheets_reference.html.

Next, we create a figure and add a subplot to it that we will be drawing in. Finally, we use the .bar(...) method to plot our histogram and set the title. Here's what the chart looks like:

That's it!

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...