First, as always, we import all the modules we will need to run this example:
- pyspark.sql.functions gives us access to PySpark SQL functions. We will use it to create our DataFrame with random numbers.
- The pandas framework will give us access to the .Series(...) datatype so we can return a column from our UDF.
- scipy.stats give us access to statistical methods. We will use it to calculate the normal PDF for our random numbers.
Next, our big_df. SparkSession has a convenience method, .range(...), which allows us to create a range of numbers within specified bounds; in this example, we simply create a DataFrame with one million records.
In the next line, we add another column to our DataFrame using the .withColumn(...) method; the column's name is val and it will contain one million .rand() numbers.
Finally, we .cache() the DataFrame so it all remains fully in memory (for speeding up the process).
Next, we define the pandas_cdf(...) method. Note the @f.pandas_udf decorator preceding the method's declaration as this is key to registering a vectorized UDF in PySpark and has only became available in Spark 2.3.
The first parameter to the decorator method is the return type of the UDF, in our case a double. This can be either a DDL-formatted type string or pyspark.sql.types.DataType. The second parameter is the function type; if we return a single column from our method (such as pandas' .Series(...) in our example), it will be .PandasUDFType.SCALAR (by default). If, on the other hand, we operate on multiple columns (such as pandas' DataFrame(...)), we would define .PandasUDFType.GROUPED_MAP.
Our pandas_pdf(...) method simply accepts a single column and returns a pandas' .Series(...) object with values of normal CDF-corresponding numbers.
Finally, we simply use the new method to transform our data. Here's what the top five records look like (yours most likely will look different since we are creating one million random numbers):