Chapter 3. Working with Spark and MLlib

Now that we are powered with the knowledge of where and how statistics and machine learning fits in the global data-driven enterprise architecture, let's stop at the specific implementations in Spark and MLlib, a machine learning library on top of Spark. Spark is a relatively new member of the big data ecosystem that is optimized for memory usage as opposed to disk. The data can be still spilled to disk as necessary, but Spark does the spill only if instructed to do so explicitly, or if the active dataset does not fit into the memory. Spark stores lineage information to recompute the active dataset if a node goes down or the information is erased from memory for some other reason. This is in contrast to the traditional MapReduce approach, where the data is persisted to the disk after each map or reduce task.

Spark is particularly suited for iterative or statistical machine learning algorithms over a distributed set of nodes and can scale out of core. The only limitation is the total memory and disk space available across all Spark nodes and the network speed. I will cover the basics of Spark architecture and implementation in this chapter.

One can direct Spark to execute data pipelines either on a single node or across a set of nodes with a simple change in the configuration parameters. Of course, this flexibility comes at a cost of slightly heavier framework and longer setup times, but the framework is very parallelizable and as most of modern laptops are already multithreaded and sufficiently powerful, this usually does not present a big issue.

In this chapter, we will cover the following topics:

  • Installing and configuring Spark if you haven't done so yet
  • Learning the basics of Spark architecture and why it is inherently tied to the Scala language
  • Learning why Spark is the next technology after sequential implementations and Hadoop MapReduce
  • Learning about Spark components
  • Looking at the simple implementation of word count in Scala and Spark
  • Looking at the streaming word count implementation
  • Seeing how to create Spark DataFrames from either a distributed file or a distributed database
  • Learning about Spark performance tuning

Setting up Spark

If you haven't done so yet, you can download the pre-build Spark package from http://spark.apache.org/downloads.html. The latest release at the time of writing is 1.6.1:

Setting up Spark

Figure 03-1. The download site at http://spark.apache.org with recommended selections for this chapter

Alternatively, you can build the Spark by downloading the full source distribution from https://github.com/apache/spark:

$ git clone https://github.com/apache/spark.git
Cloning into 'spark'...
remote: Counting objects: 301864, done.
...
$ cd spark
$sh ./ dev/change-scala-version.sh 2.11
...
$./make-distribution.sh --name alex-build-2.6-yarn --skip-java-test --tgz -Pyarn -Phive -Phive-thriftserver -Pscala-2.11 -Phadoop-2.6
...

The command will download the necessary dependencies and create the spark-2.0.0-SNAPSHOT-bin-alex-spark-build-2.6-yarn.tgz file in the Spark directory; the version is 2.0.0, as it is the next release version at the time of writing. In general, you do not want to build from trunk unless you are interested in the latest features. If you want a released version, you can checkout the corresponding tag. Full list of available versions is available via the git branch –r command. The spark*.tgz file is all you need to run Spark on any machine that has Java JRE.

The distribution comes with the docs/building-spark.md document that describes other options for building Spark and their descriptions, including incremental Scala compiler, zinc. Full Scala 2.11 support is in the works for the next Spark 2.0.0 release.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset