Chapter 11. Packaging Spark Applications

So far we have been working with a very convenient way of developing code in Spark - the Jupyter notebooks. Such an approach is great when you want to develop a proof of concept and document what you do along the way.

However, Jupyter notebooks will not work if you need to schedule a job, so it runs every hour. Also, it is fairly hard to package your application as it is not easy to split your script into logical chunks with well-defined APIs - everything sits in a single notebook.

In this chapter, we will learn how to write your scripts in a reusable form of modules and submit jobs to Spark programmatically.

Before you begin, however, you might want to check out the Bonus Chapter 2, Free Spark Cloud Offering where we provide instructions on how to subscribe and use either Databricks' Community Edition or Microsoft's HDInsight Spark offerings; the instructions on how to do so can be found here: https://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdf.

In this chapter you will learn:

  • What the spark-submit command is
  • How to package and deploy your app programmatically
  • How to modularize your Python code and submit it along with PySpark script

The spark-submit command

The entry point for submitting jobs to Spark (be it locally or on a cluster) is the spark-submit script. The script, however, allows you not only to submit the jobs (although that is its main purpose), but also kill jobs or check their status.

Note

Under the hood, the spark-submit command passes the call to the spark-class script that, in turn, starts a launcher Java application. For those interested, you can check the GitHub repository for Spark: https://github.com/apache/spark/blob/master/bin/sparksubmitt.

The spark-submit command provides a unified API for deploying apps on a variety of Spark supported cluster managers (such as Mesos or Yarn), thus relieving you from configuring your application for each of them separately.

On the general level, the syntax looks as follows:

spark-submit [options] <python file> [app arguments]

We will go through the list of all the options soon. The app arguments are the parameters you want to pass to your application.

Note

You can either parse the parameters from the command line yourself using sys.argv (after import sys) or you can utilize the argparse module for Python.

Command line parameters

You can pass a host of different parameters for Spark engine when using spark-submit.

Note

In what follows we will cover only the parameters specific for Python (as spark-submit can also be used to submit applications written in Scala or Java and packaged as .jar files).

We will now go through the parameters one-by-one so you have a good overview of what you can do from the command line:

  • --master: Parameter used to set the URL of the master (head) node. Allowed syntax is:
    • local: Used for executing your code on your local machine. If you pass local, Spark will then run in a single thread (without leveraging any parallelism). On a multi-core machine you can specify either, the exact number of cores for Spark to use by stating local[n] where n is the number of cores to use, or run Spark spinning as many threads as there are cores on the machine using local[*].
    • spark://host:port: It is a URL and a port for the Spark standalone cluster (that does not run any job scheduler such as Mesos or Yarn).
    • mesos://host:port: It is a URL and a port for the Spark cluster deployed over Mesos.
    • yarn: Used to submit jobs from a head node that runs Yarn as the workload balancer.
  • --deploy-mode: Parameter that allows you to decide whether to launch the Spark driver process locally (using client) or on one of the worker machines inside the cluster (using the cluster option). The default for this parameter is client. Here's an excerpt from Spark's documentation that explains the differences with more specificity (source: http://bit.ly/2hTtDVE):

    A common deployment strategy is to submit your application from [a screen session on] a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

    Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Currently, standalone mode does not support cluster mode for Python applications.

  • --name: Name of your application. Note that if you specified the name of your app programmatically when creating SparkSession (we will get to that in the next section) then the parameter from the command line will be overridden. We will explain the precedence of parameters shortly when discussing the --conf parameter.
  • --py-files: Comma-delimited list of .py, .egg or .zip files to include for Python apps. These files will be delivered to each executor for use. Later in this chapter we will show you how to package your code into a module.
  • --files: Command gives a comma-delimited list of files that will also be delivered to each executor to use.
  • --conf: Parameter to change a configuration of your app dynamically from the command line. The syntax is <Spark property>=<value for the property>. For example, you can pass --conf spark.local.dir=/home/SparkTemp/ or --conf spark.app.name=learningPySpark; the latter would be an equivalent of submitting the --name property as explained previously.

    Note

    Spark uses the configuration parameters from three places: the parameters from the SparkConf you specify when creating SparkContext within your app take the highest precedence, then any parameter that you pass to the spark-submit script from the command line, and lastly, any parameter that is specified in the conf/spark-defaults.conf file.

  • --properties-file: File with a configuration. It should have the same set of properties as the conf/spark-defaults.conf file as it will be read instead of it.
  • --driver-memory: Parameter that specifies how much memory to allocate for the application on the driver. Allowed values have a syntax similar to the 1,000M, 2G. The default is 1,024M.
  • --executor-memory: Parameter that specifies how much memory to allocate for the application on each of the executors. The default is 1G.
  • --help: Shows the help message and exits.
  • --verbose: Prints additional debug information when running your app.
  • --version: Prints the version of Spark.

In a Spark standalone with cluster deploy mode only, or on a cluster deployed over Yarn, you can use the --driver-cores that allows specifying the number of cores for the driver (default is 1). In a Spark standalone or Mesos with cluster deploy mode only you also have the opportunity to use either of these:

  • --supervise: Parameter that, if specified, will restart the driver if it is lost or fails. This also can be set in Yarn by setting the --deploy-mode to cluster
  • --kill: Will finish the process given its submission_id
  • --status: If this command is specified, it will request the status of the specified app

In a Spark standalone and Mesos only (with the client deploy mode) you can also specify the --total-executor-cores, a parameter that will request the number of cores specified for all executors (not each). On the other hand, in a Spark standalone and YARN, only the --executor-cores parameter specifies the number of cores per executor (defaults to 1 in YARN mode, or to all available cores on the worker in standalone mode).

In addition, when submitting to a YARN cluster you can specify:

  • --queue: This parameter specifies a queue on YARN to submit the job to (default is default)
  • --num-executors: Parameter that specifies how many executor machines to request for the job. If dynamic allocation is enabled, the initial number of executors will be at least the number specified.

Now that we have discussed all the parameters it is time to put it into practice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset