How it works...

First, as always, we import all the modules we will need to run this example:

  • pyspark.sql.functions gives us access to PySpark SQL functions. We will use it to create our DataFrame with random numbers.
  • The pandas framework will give us access to the .Series(...) datatype so we can return a column from our UDF.
  • scipy.stats give us access to statistical methods. We will use it to calculate the normal PDF for our random numbers.

Next, our big_df. SparkSession has a convenience method, .range(...), which allows us to create a range of numbers within specified bounds; in this example, we simply create a DataFrame with one million records.

In the next line, we add another column to our DataFrame using the .withColumn(...) method; the column's name is val and it will contain one million .rand() numbers.

The .rand() method returns pseudo-random numbers drawn from a uniform distribution that ranges between 0 and 1.

Finally, we .cache() the DataFrame so it all remains fully in memory (for speeding up the process).

Next, we define the pandas_cdf(...) method. Note the @f.pandas_udf decorator preceding the method's declaration as this is key to registering a vectorized UDF in PySpark and has only became available in Spark 2.3.

Note that we did not have to decorate our method; we could have instead registered our vectorized method as f.pandas_udf(f=pandas_pdf, returnType='double', functionType=f.PandasUDFType.SCALAR)

The first parameter to the decorator method is the return type of the UDF, in our case a double. This can be either a DDL-formatted type string or pyspark.sql.types.DataType. The second parameter is the function type; if we return a single column from our method (such as pandas' .Series(...) in our example), it will be .PandasUDFType.SCALAR (by default). If, on the other hand, we operate on multiple columns (such as pandas' DataFrame(...)), we would define .PandasUDFType.GROUPED_MAP

Our pandas_pdf(...) method simply accepts a single column and returns a pandas' .Series(...) object with values of normal CDF-corresponding numbers.

Finally, we simply use the new method to transform our data. Here's what the top five records look like (yours most likely will look different since we are creating one million random numbers):

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset