Getting ready

In this recipe, we are solely focusing on a Linux environment (we are using Ubuntu Server 16.04 LTS). The following prerequisites are required before you can follow with the rest of the recipe:

  • A clean installation of a Linux distribution; in our case, we have installed Ubuntu Server 16.04 LTS on each machine in our cluster of three Dell R710s.
  • Each machine needs to be connected to the internet and accessible from your local machine. You will need the machines' IPs and their hostnames; on Linux, you can check the IP by issuing the ifconfig command and reading the inet addr. To check your hostname, type at cat/etc/hostname
  • On each server, we added a user group called hadoop. Following this, we have created a user called hduser and added it to the hadoop group. Also, make sure that the hduser has sudo rights. If you do not know how to do this, check the See also section of this recipe.
  • Make sure you have added the ability to reach your servers via SSH. If you cannot do this, run sudo apt-get install openssh-server openssh-client on each server to install the necessary environments.
  • If you want to read and write to Hadoop and Hive, you need to have these two environments installed and configured on your cluster. Check https://data-flair.training/blogs/install-hadoop-2-x-on-ubuntu/ for Hadoop installation and configuration and http://www.bogotobogo.com/Hadoop/BigData_hadoop_Hive_Install_On_Ubuntu_16_04.php for Hive.
If you have these two environments set up, some of the steps from our script would be obsolete. However, we will present all of the steps as follows, assuming you only want the Spark environment.

No other prerequisites are required.

For the purpose of automating the deployment of the Spark environment in a cluster setup, you will also have to:

  1. Create a hosts.txt file. Each entry on the list is the IP address of one of the servers followed by two spaces and a hostname. Do not delete the driver: nor executors: lines. Also, note that we only allow one driver in our cluster (some clusters support redundant drivers). An example of the content of this file is as follows:
driver:
192.168.17.160 pathfinder
executors:
192.168.17.161 discovery1
192.168.17.162 discovery2
  1. On your local machine, add the IPs and hostnames to your /etc/hosts file so you can access the servers via hostnames instead of IPs (once again, we are assuming you are running a Unix-like system such as macOS or Linux). For example, the following command will add pathfinder to our /etc/hosts file: sudo echo 192.168.1.160  pathfinder >> /etc/hosts. Repeat this for all machines from your server.
  1. Copy the hosts.txt file to each machine in your cluster; we assume the file will be placed in the root folder for the hduser. You can attain this easily with the scp hosts.txt hduser@<your-server-name>:~ command, where <your-server-name> is the hostname of the machine.
  2. To run the installOnRemote.sh script (see the Chapter01/installOnRemote.sh file) from your local machine, do the following: ssh -tq hduser@<your-server-name> "echo $(base64 -i installOnRemote.sh) | base64 -d | sudo bash". We will go through these steps in detail in the installOnRemote.sh script in the next section.
  3. Follow the prompts on the screen to finalize the installation and configuration steps. Repeat step 4 for each machine in your cluster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset