Deeper into Spark

In this section, we will show you the advanced features of Spark including the use of shared variables (both the broadcast variables and accumulators) and their underlying concept will be discussed. However, we will discuss the data partition in later chapters.

Shared variables

The concept of shared variables in the context programming is not new. The variables that are required to use by many functions, and methods in parallel are called shared variables. Spark has some mechanism to use or implement the shared variables. In spark, the functions are passed to a spark operation like a map or reduce is executed on remote cluster nodes. The codes or functions work as a separate copy of variables on the nodes and no updates of the results are propagated back to the driver program. However, Spark provides two types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables

Broadcast variables provide the facility to persist a read-only to be variable cached on local machine rather than sending a copy to the computing nodes or driver program. Providing the copy of large input Dataset to every node in an efficient manner of spark. It also reduces the communication cost because Spark uses an efficient broadcast. Broadcast variables can be created from a variable v by calling SparkContext.broadcast(v). The following code shows this:

Broadcast<int[]> broadcastVariable=sc.broadcast(new int[] {2,3,4}); 
int[] values = broadcastVariable.value(); 
for(int i:values){ 
  System.out.println(i);} 

Accumulators

Accumulators are another shared variable can be used to implement counters (as in MapReduce) or sums. Spark provides the supports for accumulators to be of numeric types only. However, you can also add support for new data types using existing techniques [1]. It is created from an initial value say val by calling:

SparkContext. accumulator(val) 

The following code shows the uses of accumulator for adding the elements of an array:

Accumulator<Integer> accumulator = sc.accumulator(0); 
sc.parallelize(Arrays.asList(1, 5, 3, 4)) 
.foreach(x -> accumulator.add(x));   
System.out.println(accumulator.value()); 

There are many types and method in the Spark APIs needed to be known. However, more and details discussion is out of the scope of this book.

Tip

Interested readers should refer Spark and related materials on the following web pages:

Spark programming guide: http://spark.apache.org/docs/latest/programming-guide.html.

Spark RDD operation: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

Spark SQL operation: http://spark.apache.org/docs/latest/sql-programming-guide.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset