When your computation is done, it is better to stop your cluster to avoid additional cost. To stop your clusters, execute the following commands from your local machine:
$ SPARK_HOME/ec2/spark-ec2 --region=<ec2-region> stop <cluster-name>
For our case, it would be the following:
$ SPARK_HOME/ec2/spark-ec2 --region=eu-west-1 stop ec2-spark-cluster-1
To restart the cluster later on, execute the following command:
$ SPARK_HOME/ec2/spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>
For our case, it will be something like the following:
$ SPARK_HOME/ec2/spark-ec2 --identity-file=/usr/local/key/-key-pair.pem --region=eu-west-1 start ec2-spark-cluster-1
Finally, to terminate your Spark cluster on AWS we use the following code:
$ SPARK_HOME/ec2/spark-ec2 destroy <cluster-name>
In our case, it would be the following:
$ SPARK_HOME /spark-ec2 --region=eu-west-1 destroy ec2-spark-cluster-1
Spot instances are great for reducing AWS costs, sometimes cutting instance costs by a whole order of magnitude. A step-by-step guideline using this facility can be accessed at http://blog.insightdatalabs.com/spark-cluster-step-by-step/.
Sometimes, it's difficult to move large dataset, say 1 TB of raw data file. In that case, and if you want your application to scale up even more for large-scale datasets, the fastest way of doing so is loading them from Amazon S3 or EBS device to HDFS on your nodes and specifying the data file path using hdfs://.
1. Having the URIs/URLs (including HTTP) via http://
2. Using the Amazon S3 via s3n://
3. Using the HDFS via hdfs://
If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as hdfs://...; otherwise file://.