Calculating averages with map and reduce

We will be answering the following three main questions in this section:

  • How do we calculate averages?
  • What is a map?
  • What is reduce?

You can check the documentation at https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=map#pyspark.RDD.map

The map function takes two arguments, one of which is optional. The first argument to map is f, which is a function that gets applied to the RDD throughout by the map function. The second argument, or parameter, is the preservesPartitioning parameter, which is False by default.

If we look at the documentation, it says that map simply returns a new RDD by applying a function to each element of this RDD, and obviously, this function refers to f that we feed into the map function itself. There's a very simple example in the documentation, where it says if we parallelize an rdd method that contains a list of three characters, b, a, and c, and we map a function that creates a tuple of each element, then we'll create a list of three-tuples, where the original character is placed in the first elements of the tuple, and the 1 integer is placed in the second as follows:

rdd =  sc.paralleize(["b", "a", "c"])
sorted(rdd.map(lambda x: (x, 1)).collect())

This will give us the following output:

[('a', 1), ('b', 1), ('c', 1)]

The reduce function takes only one argument, which is f. f is a function to reduce a list into one number. From a technical point of view, the specified commutative and associative binary operator reduces the elements of this RDD.

Let's take an example using the KDD data we have been using. We launch our Jupyter Notebook instance that links to a Spark instance, as we have done previously. We then create a raw_data variable by loading a kddcup.data.gz text file from the local disk as follows:

raw_data = sc.textFile("./kddcup.data.gz")

The next thing to do is to split this file into csv, and then we will filter for rows where feature 41 includes the word normal:

csv = raw_data.map(lambda x: x.split(","))
normal_data = csv.filter(lambda x: x[41]=="normal.")

Then we use the map function to convert this data into an integer, and then, finally, we can use the reduce function to compute the total_duration, and then we can print the total_duration as follows:

duration = normal_data.map(lambda x: int(x[0]))
total_duration = duration.reduce(lambda x, y: x+y)
total_duration

We will then get the following output:

211895753

The next thing to do is to divide total_duration by the counts of the data as follows:

total_duration/(normal_data.count())

This will give us the following output:

217.82472416710442

And after a little computation, we would have created two counts using map and reduce. We have just learned how we can calculate averages with PySpark, and what the map and reduce functions are in PySpark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset