There's more...

Let's now compare the performance of the two approaches:

def test_pandas_pdf():
    return (big_df
            .withColumn('probability', pandas_pdf(big_df.val))
            .agg(f.count(f.col('probability')))
            .show()
        )

%timeit -n 1 test_pandas_pdf()

# row-by-row version with Python-JVM conversion
@f.udf('double')
def pdf(v):
    return float(stats.norm.pdf(v))

def test_pdf():
    return (big_df
            .withColumn('probability', pdf(big_df.val))
            .agg(f.count(f.col('probability')))
            .show()
        )

%timeit -n 1 test_pdf()

The test_pandas_pdf() method simply uses the pandas_pdf(...) method to retrieve the PDF from the normal distribution, performs the .count(...) operation, and prints out the results using the .show(...) method. The test_pdf() method does the same but uses the pdf(...) method instead, which is the row-by-row way of using the UDFs.

The %timeit decorator simply runs the test_pandas_pdf() or the test_pdf() methods seven times, multiplied by each execution. Here's a sample output (abbreviated as it is, as you might have expected, highly repetitive) for running the test_pandas_pdf() method:

The timings for the test_pdf() method are quoted as follows:

As you can see, the vectorized UDFs provide ~100x performance improvements! Don't get too excited, as such speedups are only expected for more complex queries, such as the one we used previously.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...