Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Setting up Spark

Apache Spark is a project in the Hadoop ecosystem (refer to the Using HDFS recipe), which purportedly performs better than Hadoop's MapReduce. Spark loads data into memory as much as possible, and it has good support for machine learning. In the Clustering data with Spark recipe, we will apply a machine learning algorithm via Spark.

Spark can work standalone, but it is designed to work with Hadoop using HDFS. Resilient Distributed Datasets (RDDs) are the central structure in Spark, and they represent distributed data. Spark has good support for Scala, which is a JVM language, and a somewhat lagging support for Python. For instance, the support to stream in the pyspark API lags a bit. Spark also has the concept of DataFrames, but it is not implemented through pandas, but through a Spark implementation.

Getting ready

Download Spark from the downloads page at https://spark.apache.org/downloads.html (retrieved September 2015). I downloaded the spark-1.5.0-bin-hadoop2.6.tgz archive for Spark 1.5.0.

Unpack the archive in an appropriate directory.

How to do it…

The following steps illustrate a basic setup for Spark with a few optional steps:

If you want to use a different Python version than the system Python, set the PYSPARK_PYTHON environment variable via the GUI of your operating system or the CLI, as follows:
```
$ export PYSPARK_PYTHON=/path/to/anaconda/bin/python
```

Set the SPARK_HOME environment variable, as follows:

$ export SPARK_HOME=<path/to/spark/>spark-1.5.0-bin-hadoop2.6

Add the python directory to your PYTHONPATH environment variable, as follows:
```
$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
```

Add the ZIP of py4j to your PYTHONPATH environment variable, as follows:

$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

If the logging of Spark is too verbose, copy the log4j.properties.template file in the $SPARK_HOME/conf directory to log4j.properties and change the INFO levels to WARN.

Table of Contents for
Setting up Spark

Setting up Spark

Getting ready

How to do it…

See also

Table of Contents for Setting up Spark

Create new playlist

Sign In

Sign Up

Setting up Spark

Getting ready

How to do it…

See also

Table of Contents for
Setting up Spark