CHAPTER 14

image

Using Apache Spark

Apache Spark is a data processing engine for large data sets. Apache Spark is much faster (up to 100 times faster in memory) than Apache Hadoop MapReduce. In cluster mode, Spark applications run as independent processes coordinated by the SparkContext object in the driver program, which is the main program. The SparkContext may connect to several types of cluster managers to allocate resources to Spark applications. The supported cluster managers include the Standalone cluster manager, Mesos and YARN. Apache Spark is designed to access data from varied data sources including the HDFS, Apache HBase and NoSQL databases such as Apache Cassandra and MongoDB. In this chapter we shall use the same CDH Docker image that we used for several of the Apache Hadoop frameworks including Apache Hive and Apache HBase. We shall run an Apache Spark Master in cluster mode using the YARN cluster manager in a Docker container.

  • Setting the Environment
  • Running the Docker Container for CDH
  • Running Apache Spark Job in yarn-cluster Mode
  • Running Apache Spark Job in yarn-client Mode
  • Running the Apache Spark Shell

Setting the Environment

The following software is required for this chapter.

  • -Docker Engine (version 1.8)
  • -Docker image for Apache Spark

Connect to an Amazon EC2 instance using the public IP address for the instance. The public IP address may be found from the Amazon EC2 Console as explained in Appendix A.

ssh -i "docker.pem" [email protected]

Start the Docker service and verify status as started.

sudo service docker start
sudo service docker status

Download the Docker image for CDH, the svds/cdh image if not already downloaded for an earlier chapter.

sudo docker pull svds/cdh

Docker image svds/cdh gets downloaded as shown in Figure 14-1.

9781484218297_Fig14-01.jpg

Figure 14-1. Downloading svds/cdh Docker Image

Running the Docker Container for CDH

Start a Docker container for the CDH frameworks using the Apache Spark Master port as 8088.

sudo docker run  -p 8088 -d --name cdh svds/cdh

List the running Docker containers.

sudo docker ps

CDH processes including Apache Spark get started and the container cdh gets listed as running as shown in Figure 14-2.

9781484218297_Fig14-02.jpg

Figure 14-2. Starting Docker Container for CDH including Apache Spark

Start an interactive terminal for the cdh container.

sudo docker exec -it cdh bash

The interactive terminal gets started as shown in Figure 14-3.

9781484218297_Fig14-03.jpg

Figure 14-3. Starting the TTY

In YARN mode, a Spark application may be submitted to a cluster in yarn-cluster mode or yarn-client mode. In the yarn-cluster mode, the Apache Spark driver runs inside an Application Master, which is managed by the YARN. In yarn-client mode. The Spark driver runs in the client process outside of YARN and the Application Master is used only for requesting resources from YARN. The --master parameter is yarn-cluster or yarn-client based on the mode of application submission. In yarn-client mode the Spark driver logs to the console.

We shall run a Spark application using each of the application submission modes. We shall use the example application org.apache.spark.examples.SparkPi.

Running Apache Spark Job in yarn-cluster Mode

To submit the Spark application SparkPi in yarn-cluster mode using 1000 iterations, run the following spark-submit command with the --master parameter as yarn-cluster.

spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000

The preceding command is run from the interactive terminal as shown in Figure 14-4.

9781484218297_Fig14-04.jpg

Figure 14-4. Submitting the Spark Application in yarn-cluster Mode

The output from the Spark application is shown in Figure 14-5.

9781484218297_Fig14-05.jpg

Figure 14-5. Output from Spark Job in yarn-cluster Mode

A more detailed output from the spark-submit command is listed:

spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000
15/10/23 19:12:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/23 19:12:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/10/23 19:12:56 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
15/10/23 19:12:56 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/10/23 19:12:56 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/10/23 19:12:56 INFO yarn.Client: Setting up container launch context for our AM
15/10/23 19:12:56 INFO yarn.Client: Preparing resources for our AM container
15/10/23 19:12:59 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
15/10/23 19:12:59 INFO yarn.Client: Uploading resource file:/usr/lib/spark/lib/spark-assembly-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar -> hdfs://localhost:8020/user/root/.sparkStaging/application_1445627521793_0001/spark-assembly-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar
15/10/23 19:13:05 INFO yarn.Client: Uploading resource file:/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar -> hdfs://localhost:8020/user/root/.sparkStaging/application_1445627521793_0001/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar
15/10/23 19:13:06 INFO yarn.Client: Setting up the launch environment for our AM container
15/10/23 19:13:07 INFO spark.SecurityManager: Changing view acls to: root
15/10/23 19:13:07 INFO spark.SecurityManager: Changing modify acls to: root
15/10/23 19:13:07 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/10/23 19:13:07 INFO yarn.Client: Submitting application 1 to ResourceManager
15/10/23 19:13:08 INFO impl.YarnClientImpl: Submitted application application_1445627521793_0001
15/10/23 19:13:09 INFO yarn.Client: Application report for application_1445627521793_0001 (state: ACCEPTED)
15/10/23 19:13:09 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: root.root
         start time: 1445627587658
         final status: UNDEFINED
         tracking URL: http://4b4780802318:8088/proxy/application_1445627521793_0001/
         user: root
15/10/23 19:13:10 INFO yarn.Client: Application report for application_1445627521793_0001 (state: ACCEPTED)
15/10/23 19:13:11 INFO yarn.Client: Application report for application_1445627521793_0001 (state: ACCEPTED)
15/10/23 19:13:24 INFO yarn.Client: Application report for application_1445627521793_0001 (state: RUNNING)
15/10/23 19:13:24 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 4b4780802318
         ApplicationMaster RPC port: 0
         queue: root.root
         start time: 1445627587658
         final status: UNDEFINED
         tracking URL: http://4b4780802318:8088/proxy/application_1445627521793_0001/
         user: root
15/10/23 19:13:25 INFO yarn.Client: Application report for application_1445627521793_0001 (state: RUNNING)
15/10/23 19:13:26 INFO yarn.Client: Application report for
15/10/23 19:13:51 INFO yarn.Client: Application report for application_1445627521793_0001 (state: FINISHED)
15/10/23 19:13:51 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 4b4780802318
         ApplicationMaster RPC port: 0
         queue: root.root
         start time: 1445627587658
         final status: SUCCEEDED
         tracking URL: http://4b4780802318:8088/proxy/application_1445627521793_0001/A
         user: root

In yarn-cluster mode, the Spark application result is not output to the console and has to be accessed from the YARN container logs accessible from the ResourceManager using the tracking URL http://4b4780802318:8088/proxy/application_1445627521793_0001/A in a browser if the final status is SUCCEEDED.

Running Apache Spark Job in yarn-client Mode

To submit the Spark application SparkPi in yarn-client mode using 1000 iterations, run the following spark-submit command with the --master parameter as yarn-client.

spark-submit
     --master yarn-client
     --class org.apache.spark.examples.SparkPi
      /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar
      1000

The output from the spark-submit command is shown in Figure 14-6.

9781484218297_Fig14-06.jpg

Figure 14-6. Submitting Spark Application in yarn-client Mode

A more detailed output from the Apache Spark application is as follows and includes the value of Pi calculated approximately.

spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000
15/10/23 19:15:19 INFO spark.SparkContext: Running Spark version 1.3.0
15/10/23 19:15:43 INFO cluster.YarnScheduler: Adding task set 0.0 with 1000 tasks
15/10/23 19:15:43 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 4b4780802318, PROCESS_LOCAL, 1353 bytes)
15/10/23 19:15:43 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 4b4780802318, PROCESS_LOCAL, 1353 bytes)
15/10/23 19:15:57 INFO scheduler.TaskSetManager: Finished task 999.0 in stage 0.0 (TID 999) in 22 ms on 4b4780802318 (999/1000)
15/10/23 19:15:57 INFO scheduler.TaskSetManager: Finished task 998.0 in stage 0.0 (TID 998) in 28 ms on 4b4780802318 (1000/1000)
15/10/23 19:15:57 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/10/23 19:15:57 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 14.758 s
15/10/23 19:15:57 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 15.221643 s

Pi is roughly 3.14152984

Running the Apache Spark Shell

The Apache Spark shell is started in yarn-client mode as follows.

spark-shell --master yarn-client

The scala> command prompt gets displayed as shown in Figure 14-7. A Spark context gets created and becomes available as ‘sc’. A SQL context also becomes available as ’sqlContext’.

9781484218297_Fig14-07.jpg

Figure 14-7. The scala> Command Prompt

A more detailed output from the spark-shell command is as follows.

root@4b4780802318:/# spark-shell --master yarn-client
15/10/23 19:17:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/23 19:17:16 INFO spark.SecurityManager: Changing view acls to: root
15/10/23 19:17:16 INFO spark.SecurityManager: Changing modify acls to: root
15/10/23 19:17:16 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/10/23 19:17:16 INFO spark.HttpServer: Starting HTTP Server
15/10/23 19:17:16 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/10/23 19:17:16 INFO server.AbstractConnector: Started [email protected]:56899
15/10/23 19:17:16 INFO util.Utils: Successfully started service ’HTTP class server’ on port 56899.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  ’_/
   /___/ .__/\_,_/_/ /_/\_   version 1.3.0
      /_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/23 19:17:22 INFO spark.SparkContext: Running Spark version 1.3.0
15/10/23 19:17:45 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
15/10/23 19:17:45 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/10/23 19:17:45 INFO storage.BlockManagerMasterActor: Registering block manager 4b4780802318:48279 with 530.3 MB RAM, BlockManagerId(2, 4b4780802318, 48279)
scala>

Run the following Scala script consisting of a HelloWorld module in the Spark shell for a Hello World program.

object HelloWorld {
    def main(args: Array[String]) {
      println("Hello, world!")
    }
  }
HelloWorld.main(null)

The output from the Scala script is shown in Figure 14-8.

9781484218297_Fig14-08.jpg

Figure 14-8. Output from Scala Script

Summary

In this chapter, we ran Apache Spark applications on a YARN cluster in a Docker container using the spark-submit command. We submitted the example application in yarn-cluster and yarn-client modes. We also ran a HelloWorld Scala script in a Spark shell.

This chapter concludes the book on Docker. In addition to running some of the commonly used software on Docker, we discussed the main Docker administrative tasks such as installing Docker, downloading a Docker image, creating and running a Docker container, starting an interactive shell, running commands in an interactive shell, listing Docker containers, listing Docker container logs, stopping a Docker container, and removing a Docker container and a Docker image. Only a few of the software applications could be discussed in the scope of this book. Several more Docker images are available on the Docker hub at https://hub.docker.com/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset