Apache Spark is a project in the Hadoop ecosystem (refer to the Using HDFS recipe), which purportedly performs better than Hadoop's MapReduce. Spark loads data into memory as much as possible, and it has good support for machine learning. In the Clustering data with Spark recipe, we will apply a machine learning algorithm via Spark.
Spark can work standalone, but it is designed to work with Hadoop using HDFS. Resilient Distributed Datasets (RDDs) are the central structure in Spark, and they represent distributed data. Spark has good support for Scala, which is a JVM language, and a somewhat lagging support for Python. For instance, the support to stream in the pyspark API lags a bit. Spark also has the concept of DataFrames, but it is not implemented through pandas, but through a Spark implementation.
Download Spark from the downloads page at https://spark.apache.org/downloads.html (retrieved September 2015). I downloaded the spark-1.5.0-bin-hadoop2.6.tgz
archive for Spark 1.5.0.
Unpack the archive in an appropriate directory.
The following steps illustrate a basic setup for Spark with a few optional steps:
PYSPARK_PYTHON
environment variable via the GUI of your operating system or the CLI, as follows:$ export PYSPARK_PYTHON=/path/to/anaconda/bin/python
SPARK_HOME
environment variable, as follows:$ export SPARK_HOME=<path/to/spark/>spark-1.5.0-bin-hadoop2.6
python
directory to your PYTHONPATH
environment variable, as follows:$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
py4j
to your PYTHONPATH
environment variable, as follows:$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
log4j.properties.template
file in the $SPARK_HOME/conf
directory to log4j.properties
and change the INFO levels to WARN.The official Spark website is at http://spark.apache.org/ (retrieved September 2015)