Starting Spark shell

The first step is to prepare the Spark environment to perform analysis. As in the previous chapter, we are going to start Spark shell; however, in this case, the command line is slightly more complicated:

export SPARKLING_WATER_VERSION="2.1.12" 
export SPARK_PACKAGES= 
"ai.h2o:sparkling-water-core_2.11:${SPARKLING_WATER_VERSION}, 
ai.h2o:sparkling-water-repl_2.11:${SPARKLING_WATER_VERSION}, 
ai.h2o:sparkling-water-ml_2.11:${SPARKLING_WATER_VERSION}, 
com.packtpub:mastering-ml-w-spark-utils:1.0.0" 
 
$SPARK_HOME/bin/spark-shell  
        --master 'local[*]'  
        --driver-memory 8g  
        --executor-memory 8g  
        --conf spark.executor.extraJavaOptions=-XX:MaxPermSize=384M
         
        --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=384M  
        --packages "$SPARK_PACKAGES"

In this case, we require more memory since we are going to load larger data. We also need to increase the size of PermGen - a part of JVM memory which stores information about loaded classes. This is only necessary if you are using Java 7.

The memory settings for Spark jobs are an important part of job launching. In the simple local[*]-based scenario as we are using, there is no difference between the Spark driver and executor. However, for a larger job deployed on a standalone or YARN Spark cluster, the configuration of driver memory and executor memory needs to reflect the size of the data and performed transformations.
Moreover, as we discussed in the previous chapter, you can mitigate memory pressure by using a clever caching strategy and the right cache destination (for example, disk, off-heap memory).

Table of Contents for Starting Spark shell

Create new playlist

Sign In

Sign Up

Table of Contents for
Starting Spark shell