Let's now compare the performance of the two approaches:
def test_pandas_pdf():
return (big_df
.withColumn('probability', pandas_pdf(big_df.val))
.agg(f.count(f.col('probability')))
.show()
)
%timeit -n 1 test_pandas_pdf()
# row-by-row version with Python-JVM conversion
@f.udf('double')
def pdf(v):
return float(stats.norm.pdf(v))
def test_pdf():
return (big_df
.withColumn('probability', pdf(big_df.val))
.agg(f.count(f.col('probability')))
.show()
)
%timeit -n 1 test_pdf()
The test_pandas_pdf() method simply uses the pandas_pdf(...) method to retrieve the PDF from the normal distribution, performs the .count(...) operation, and prints out the results using the .show(...) method. The test_pdf() method does the same but uses the pdf(...) method instead, which is the row-by-row way of using the UDFs.
The %timeit decorator simply runs the test_pandas_pdf() or the test_pdf() methods seven times, multiplied by each execution. Here's a sample output (abbreviated as it is, as you might have expected, highly repetitive) for running the test_pandas_pdf() method:
The timings for the test_pdf() method are quoted as follows:
As you can see, the vectorized UDFs provide ~100x performance improvements! Don't get too excited, as such speedups are only expected for more complex queries, such as the one we used previously.