Running spark-submit

Using a console, go to the Spark distribution's bin folder, and run spark-submit with the path of the assembly JAR:

cd ~/spark-2.3.1-bin-hadoop2.7/bin
./spark-submit /tmp/sbt/bitcoin-analyser/scala-2.11/bitcoin-analyser-assembly-0.1.jar

spark-submit has lots of options that let you change the Spark master, the number of executors, their memory requirements, and so on. You can find out more by running spark-submit -h. After having submitted our JAR, you should see something like this in your console (we only show the most important parts):

(...)
2018-09-07 07:55:27 INFO SparkContext:54 - Running Spark version 2.3.1
2018-09-07 07:55:27 INFO SparkContext:54 - Submitted application: coinyser.BatchProducerAppSpark
(...)
2018-09-07 07:55:28 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
(...)
2018-09-07 07:55:28 INFO SparkContext:54 - Added JAR file:/tmp/sbt/bitcoin-analyser/scala-2.11/bitcoin-analyser-assembly-0.1.jar at spark://192.168.0.11:38371/jars/bitcoin-analyser-assembly-0.1.jar with timestamp 1536303328633
2018-09-07 07:55:28 INFO Executor:54 - Starting executor ID driver on host localhost
(...)
2018-09-07 07:55:28 INFO NettyBlockTransferService:54 - Server created on 192.168.0.11:37370
(...)
2018-09-07 07:55:29 INFO BatchProducerApp$:23 - calling https://www.bitstamp.net/api/v2/transactions/btcusd?time=day
(...)
2018-09-07 07:55:37 INFO SparkContext:54 - Starting job: parquet at BatchProducer.scala:115
(...)
2018-09-07 07:55:39 INFO DAGScheduler:54 - Job 0 finished: parquet at BatchProducer.scala:115, took 2.163065 s

You would see a similar output if you were using a remote cluster:

  • The line Submitted application: Tells us what main class we submitted. This corresponds to the mainClass setting that we put in our SBT file. This can be overridden with the --class option in spark-submit.
  • A few lines after, we can see that Spark started a SparkUI web server on port 4040. With your web browser, go to the URL http://localhost:4040 to explore this UI. It allows you to see the progress of running jobs, their execution plan, how many executors they use, the logs of the executors, and so on. SparkUI is a precious tool when you need to optimize your jobs.
  • Added JAR file: Before Spark can run our application, it must distribute the assembly JAR to all the cluster nodes. For doing this, we can see that it starts a server on port 37370. The executors would then connect to that server to download the JAR file.

  • Starting Executor ID driver: The driver process coordinates the execution of jobs with the executors. The following line shows that it listens on port 37370 to receive updates from the executors.
  • Calling https//: This corresponds to what we logged in our code, logger.info(s"calling $url").
  • Starting job: Our application started a Spark job. Line 115 in BatchProducer corresponds to the .parquet(path.toString) instruction in BatchProducer.save. This parquet method is indeed an action and as such triggers the evaluation of Dataset.
  • Job 0 finished: The job finishes after a few seconds.

After this point, you should have a parquet file saved with the last transactions. If you let the application continue for 1 hour, you will see that it starts another job to get the last hour of transactions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset