Setting up Spark

Apache Spark is a project in the Hadoop ecosystem (refer to the Using HDFS recipe), which purportedly performs better than Hadoop's MapReduce. Spark loads data into memory as much as possible, and it has good support for machine learning. In the Clustering data with Spark recipe, we will apply a machine learning algorithm via Spark.

Spark can work standalone, but it is designed to work with Hadoop using HDFS. Resilient Distributed Datasets (RDDs) are the central structure in Spark, and they represent distributed data. Spark has good support for Scala, which is a JVM language, and a somewhat lagging support for Python. For instance, the support to stream in the pyspark API lags a bit. Spark also has the concept of DataFrames, but it is not implemented through pandas, but through a Spark implementation.

Getting ready

Download Spark from the downloads page at https://spark.apache.org/downloads.html (retrieved September 2015). I downloaded the spark-1.5.0-bin-hadoop2.6.tgz archive for Spark 1.5.0.

Unpack the archive in an appropriate directory.

How to do it…

The following steps illustrate a basic setup for Spark with a few optional steps:

  1. If you want to use a different Python version than the system Python, set the PYSPARK_PYTHON environment variable via the GUI of your operating system or the CLI, as follows:
    $ export PYSPARK_PYTHON=/path/to/anaconda/bin/python
    
  2. Set the SPARK_HOME environment variable, as follows:
    $ export SPARK_HOME=<path/to/spark/>spark-1.5.0-bin-hadoop2.6
    
  3. Add the python directory to your PYTHONPATH environment variable, as follows:
    $ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    
  4. Add the ZIP of py4j to your PYTHONPATH environment variable, as follows:
    $ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
    
  5. If the logging of Spark is too verbose, copy the log4j.properties.template file in the $SPARK_HOME/conf directory to log4j.properties and change the INFO levels to WARN.

See also

The official Spark website is at http://spark.apache.org/ (retrieved September 2015)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset