reduceByKey

groupByKey involves a lot of shuffling and reduceByKey tends to improve the performance by not sending all elements of the PairRDD using shuffles, rather using a local combiner to first do some basic aggregations locally and then send the resultant elements as in groupByKey. This greatly reduces the data transferred, as we don't need to send everything over. reduceBykey works by merging the values for each key using an associative and commutative reduce function. Of course, first, this will
also perform the merging locally on each mapper before sending results to a reducer.

If you are familiar with Hadoop MapReduce, this is very similar to a combiner in MapReduce programming.

reduceByKey can be invoked either using a custom partitioner or just using the default HashPartitioner as shown in the following code snippet:

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

reduceByKey works by sending all elements of the partitions to the partition based on the partitioner so that all pairs of (key - value) for the same Key are collected in the same partition. But before the shuffle, local aggregation is also done reducing the data to be shuffled. Once this is done, the aggregation operation can be done easily in the final partition.

The following diagram is an illustration of what happens when reduceBykey is called:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset