Using Spark Notebooks for quick iteration of ideas

In this section, we will answer the following questions:

  • What are Spark Notebooks?
  • How do you start Spark Notebooks?
  • How do you use Spark Notebooks?

Let's start with setting up a Jupyter Notebook-like environment for Spark. Spark Notebook is just an interactive and reactive data science environment that uses Scala and Spark.

If we view the GitHub page (https://github.com/spark-notebook/spark-notebook), we can see that what the Notebooks do is actually very straightforward, as shown in the following screenshot:

If we look at a Spark Notebook, we can see that they look very much like what Python developers use, which is Jupyter Notebooks. You have a text box allowing you to enter some code, and then you execute the code below the text box, which is similar to a Notebook format. This allows us to perform a reproducible analysis with Apache Spark and the big data ecosystem.

So, we can use Spark Notebooks as it is, and all we need to do is go to the Spark Notebook website and click on Quick Start to get the Notebook started, as shown in the following screenshot:

We need to make sure that we are running Java 7. We can see that the setup steps are also mentioned in the documentation, as shown in the following screenshot:

The main website for Spark Notebook is spark-notebook.io, where we can see many options. A few of them have been shown in the following screenshot:

We can download the TAR file and unzip it. You can use Spark Notebook, but we will be using Jupyter Notebook in this book. So, going back to the Jupyter environment, we can look at the PySpark-accompanying code files. In Chapter 3 Notebook we have included a convenient way for us to set up the environment variables to get PySpark working with Jupyter, as shown in the following screenshot:

First, we need to create two new environment variables in our environments. If you are using Linux, you can use Bash RC. If you are using Windows, all you need to do is to change and edit your system environment variables. There are multiple tutorials online to help you do this. What we want to do here is to edit or include the PYSPARK_DRIVER_PYTHON variable and point it to your Jupyter Notebook installation. If you are on Anaconda, you probably would be pointed to the Anaconda Jupyter Bash file. Since we are on WinPython, I have pointed it to my WinPython Jupyter Notebook Bash file. The second environment variable we want to export is simply PYSPARK_DRIVER_PYTHON_OPTS.

One of the suggestions is that we include the Notebook folder and the Notebook app in the options, ask it not to open in the browser, and tell it what port to bind to. In practice, if you are on Windows and WinPython environments then you don't really need this line here, and you can simply skip it. After this has been done, simply restart your PySpark from a command line. What will happen is that, instead of having the console that we have seen before, it directly launches into a Jupyter Notebook instance, and, furthermore, we can use Spark and SparkContext variables as in Jupyter Notebook. So, let's test it out as follows:

sc

We instantly get access to our SparkContext that tells us that Spark is version 2.3.3, our Master is at local, and the AppName is the Python SparkShell (PySparkShell), as shown in the following code snippet:

SparkContext
Spark UI
Version
v2.3.3
Master
local[*]
AppName
PySparkShell

So, now we know how we create a Notebook-like environment in Jupyter. In the next section, we will look at sampling and filtering RDDs to pick out relevant data points.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset