© Raju Kumar Mishra and Sundar Rajan Raman 2019
Raju Kumar Mishra and Sundar Rajan RamanPySpark SQL Recipeshttps://doi.org/10.1007/978-1-4842-4335-0_2

2. Installation

Raju Kumar Mishra1  and Sundar Rajan Raman2
(1)
Bangalore, Karnataka, India
(2)
Chennai, Tamil Nadu, India
 

In the upcoming chapters, we are going to solve many problems using PySpark. PySpark also interacts with many other Big Data frameworks to provide end-to-end solutions. PySpark might read data from HDFS, NoSQL databases, or a relational database management system. After data analysis, we can save the results into HDFS or databases.

This chapter deals with all the software installations that are required to go through this book. We are going to install all the required Big Data frameworks on the CentOS operating system. CentOS is an enterprise-class operating system. It is free to use and easily available. We can download CentOS from the https://www.centos.org/download/ link and install it on a virtual machine.

In this chapter, we are going to discuss the following recipes:
  • Recipe 2-1. Install Hadoop on a single machine

  • Recipe 2-2. Install Spark on a single machine

  • Recipe 2-3. Use the PySpark shell

  • Recipe 2-4. Install Hive on a single machine

  • Recipe 2-5. Install PostgreSQL

  • Recipe 2-6. Configure the Hive metastore on PostgreSQL

  • Recipe 2-7. Connect PySpark to Hive

  • Recipe 2-8. Install MySQL

  • Recipe 2-9. Install MongoDB

  • Recipe 2-10. Install Cassandra

I suggest that you install every piece of software on your own. It is a good exercise and will give you a deeper understanding of the components of each software package.

Recipe 2-1. Install Hadoop on a Single Machine

Problem

You want to install Hadoop on a single machine.

Solution

You might be wondering, why are we installing Hadoop while we are learning PySpark? Are we going to use Hadoop MapReduce as the distributed framework for our problem solving? The answer is, not at all. We are going to use two components of Hadoop—HDFS and YARN. HDFS for data storage and YARN as a cluster manager. In order to install Hadoop, we need to download and configure it.

How It Works

Follow these steps to complete the Hadoop installation.

Step 2-1-1. Creating a New CentOS User

A new user is created. You might be thinking, why a new user? Why can’t we install Hadoop in an existing user? The reason behind that is that we want to provide a dedicated user for all the Big Data frameworks. In the following lines of code, we are going to create a user named pysparksqlbook.
[root@localhost book]# adduser pysparksqlbook
[root@localhost book]# passwd pysparksqlbook’
Here is the output:
Changing password for user pysparksqlbook.
New password:
passwd: all authentication tokens updated successfully.

In the above part of the code, we can see that the adduser command has been used to create or add a user. The Linux passwd command has been used to provide a password to our new user pysparksqlbook.

After creating a user, we have to add it to sudo. Sudo stands for “superuser do”. Using sudo, we can run any code as the superuser. Sudo will be used to install the software.

Step 2-1-2. Adding a CentOS user to sudo

Then, we have to add our new user pysparksqlbook to the sudo. The following command will do this.
[root@localhost book]# usermod -aG wheel pysparksqlbook [root@localhost book]# exit
Then we enter our user pysparksqlbook.
[book@localhost ~]$ su pysparksqlbook
We will create two directories—the binaries directory under the home directory to download software and the allBigData directory under the root / directory to install the Big Data frameworks.
[pysparksqlbook@localhost ~]$ mkdir binaries [pysparksqlbook@localhost ~]$ sudo mkdir /allBigData

Step 2-1-3. Installing Java

Hadoop, Hive, Spark, and many Big Data frameworks use JVM. It’s why we are first going to install Java. We are going to use OpenJDK for our purposes. We are installing the 8th version of OpenJDK. We can install Java on CentOS using the yum installer. The following command installs Java using the yum installer.
[pysparksqlbook@localhost binaries]$ sudo yum install java-1.8.0-openjdk
Here is the output:
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: centos.excellmedia.net
 * extras: centos.excellmedia.net
        .
        .
        .
        .
Updated:
  java-1.8.0-openjdk.x86_64 1:1.8.0.181-3.b13.el7_5
Dependency Updated:
  java-1.8.0-openjdk-headless.x86_64 1:1.8.0.181-3.b13.el7_5
Complete!

Java has been installed. After installation of any software, it is a good idea to check the installation. Checking the installation will show you that everything is fine.

In order to check the Java installation, I prefer the java -version command, which will return the version of JVM installed.
[pysparksqlbook@localhost binaries]$ java -version
Here is the output:
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
Java has been installed. We have to look for the environment variable JAVA_HOME, which is going to be used by all the distributed frameworks. After installing Java, we can find the Java home variable by using jrunscript, as follows.
[pysparksqlbook@localhost  binaries]$jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'
Here is the output:
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

We have found the absolute path of JAVA_HOME .

Step 2-1-4. Creating Password-Less Logging from pysparksqlbook

Here are the steps for creating a password-less login.
[pysparksqlbook@localhost binaries]$ ssh-keygen -t rsa
Here is the output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/pysparksqlbook/.ssh/id_rsa):
Created directory '/home/pysparksqlbook/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/pysparksqlbook/.ssh/id_rsa.
Your public key has been saved in /home/pysparksqlbook/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:DANT7QBm9fDHi1/VRPcytb8/d4PemcOnn0Sm9hzl93A [email protected]
The key's randomart image is:
+---[RSA 2048]----+
|    *++.       .=|
|   o o.+..     ++|
|      ooo o   +.o|
|       +.o . . o.|
|        S . .  oo|
|         . .  +.o|
|          .  o++E|
|            ..=B%|
|            ...X@|
+----[SHA256]-----+
The next command:
[pysparksqlbook@localhost binaries]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[pysparksqlbook@localhost binaries]$ chmod 755 ~/.ssh/authorized_keys
[pysparksqlbook@localhost binaries]$ ssh localhost
Here is the output:
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:md4M1J6VEYQm3gSynB0gqIYFpesp6I2cRvlEvJOIFFE.
ECDSA key fingerprint is MD5:78:cf:a7:71:2e:38:c2:62:01:65:c2:4c:71:7e:3c:90.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Thu July 13 17:02:56 2018
Finally:
[pysparksqlbook@localhost ~]$ exit
Here is the output:
logout
Connection to localhost closed.

Step 2-1-5. Downloading Hadoop

We are now going to download Hadoop from the Apache website. As mentioned, we will download all the required packages to the binaries directory. We are going to use the wget command to download Hadoop.
[pysparksqlbook@localhost ~]$ cd  binaries
[pysparksqlbook@localhost binaries]$ wget http://mirrors.fibergrid.in/apache/hadoop/common/hadoop-2.7.7/hadoop--.7.7.tar.gz
Here is the output:
--2018-06-26 12:56:36--  http://mirrors.fibergrid.in/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Resolving mirrors.fibergrid.in (mirrors.fibergrid.in)... 103.116.36.9, 2402:f4c0::9
Connecting to mirrors.fibergrid.in (mirrors.fibergrid.in)|103.116.36.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218720521 (209M) [application/x-gzip]
Saving to: 'hadoop-2.7.7.tar.gz'
100%[======================================>] 218,720,521 38.8KB/s   in 63m 45s
2018-06-26 14:00:22 (55.8 KB/s) - 'hadoop-2.7.7.tar.gz' saved [218720521/218720521]

Step 2-1-6. Moving Hadoop Binaries to the Installation Directory

Our installation directory is called allBigData . The downloaded software is in hadoop-2.7.7.tar.gz, which is a compressed directory. So we first have to decompress it. We can decompress it using the tar command as follows:
[pysparksqlbook@localhost binaries]$ tar xvzf hadoop-2.7.7.tar.gz
Now we move Hadoop under the allBigData directory.
pysparksqlbook@localhost binaries]$ sudo  mv  hadoop-2.7.7   /allBigData/hadoop

Step 2-1-7. Modifying the Hadoop Environment File

We have to make some changes to the Hadoop environment file. The Hadoop environment file is found in the Hadoop configuration directory. In our case, the Hadoop configuration directory is /allBigData/hadoop/etc/hadoop/. In the following line of code, we add JAVA_HOME to the hadoop-env.sh file.
[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hadoop-env.sh
After opening the Hadoop environment file, add the following line.
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

Step 2-1-8. Modifying the Hadoop Properties Files

We will be focusing on three properties files:
  • hdfs-site.xml: HDFS properties

  • core-site.xml: Core properties related to the cluster

  • mapred-site.xml: Properties for the MapReduce Framework

These properties files will be found in the Hadoop configuration directory. In the previous chapter, we discussed HDFS. We found that HDFS has two components—NameNode and DataNode. We also discussed that HDFS does data replication for fault tolerance. In our hdfs-site.xml file, we are going to set the namenode directory using the dfs.name.dir parameter, the datanode directory using the dfs.data.dir parameter, and the replication factor using the dfs.replication parameter.

Let’s modify hdfs-site.xml.
[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hdfs-site.xml
After opening hdfs-site.xml , we have to add the following lines to that file.
<property>
<name>dfs.name.dir</name>
<value>file:/allBigData/hdfs/namenode</value>
<description>NameNode location</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:/allBigData/hdfs/datanode</value>
<description>DataNode location</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description> Number of block replication </description>
</property>
After updating hdfs-site.xml, we are going to update core-site.xml. In core-site.xml, we are going to update two properties—fs.default.name and hadoop.tmp.dir. The fs.default.name property is used to determine the host, port, etc. of the filesystem. The hadoop.tmp.dir property determines the temporary directories for Hadoop. We have to add the following lines to core-site.xml.
 <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9745</value>
        <description>Host port of file system</description>
  </property>
  <property>
       <name>hadoop.tmp.dir</name>
       <value>/application/hadoop/tmp</value>
       <description>Temp directory for other working and tmp directories</description>
  </property>
Finally, we are going to modify mapred-site.xml. We are going to modify mapreduce.framework.name, which will decide which runtime framework is to be used. The possible values are local, classic, or yarn. We have to add the following code to the mapred-site.xml file.
[pysparksqlbook@localhost binaries]$ cp /allBigData/hadoop/etc/hadoop/mapred-site.xml.template /allBigData/hadoop/etc/hadoop/mapred-site.xml
[pysparksqlbook@localhost binaries]$vim /allBigData/hadoop/etc/hadoop/mapred-site.xml
Here is the XML:
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
Let’s create the temporary directory:
[pysparksqlbook@localhost binaries]$ sudo mkdir -p  /application/hadoop/tmp
[pysparksqlbook@localhost binaries]$ sudo chown pysparksqlbook:pysparksqlbook -R /application/hadoop

Step 2-1-9. Updating the .bashrc File

The following lines have to be added to the .bashrc file . Open .bashrc and append it by adding the following lines.
[pysparksqlbook@localhost binaries]$ vim  ~/.bashrc
In the .bashrc file, add the following line at the end.
export HADOOP_HOME=/allBigData/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin
Then we have to source the .bashrc file . After sourcing the file, new updated values will be reflected in the console.
[pysparksqlbook@localhost binaries]$ source ~/.bashrc

Step 2-1-10. Running the Namenode Format

We have updated some property files. We are supposed to run namenode format, so that all the changes will be reflected in our framework. The following command will format namenode.
[pysparksqlbook@localhost binaries]$ hdfs namenode -format
Here is the output:
18/06/13 18:14:49 INFO namenode.FSImageFormatProtobuf: Image file /allBigData/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 331 bytes saved in 0 seconds.
18/06/13 18:14:50 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/06/13 18:14:50 INFO util.ExitUtil: Exiting with status 0
18/06/13 18:14:50 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************

Step 2-1-11. Starting Hadoop

Hadoop has been installed. We have to start Hadoop now. We can find the Hadoop starting script in /allBigData/hadoop/sbin/. We have to run the start-dfs.sh and start-yarn.sh scripts , in sequence.
[pysparksqlbook@localhost binaries]$ /allBigData/hadoop/sbin/start-dfs.sh
Here is the output:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:md4M1J6VEYQm3gSynB0gqIYFpesp6I2cRvlEvJOIFFE.
ECDSA key fingerprint is MD5:78:cf:a7:71:2e:38:c2:62:01:65:c2:4c:71:7e:3c:90.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-secondarynamenode-localhost.localdomain.out
[pysparksqlbook@localhost binaries]$ /allBigData/hadoop/sbin/start-yarn.sh
Here is the output:
starting yarn daemons
starting resourcemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-nodemanager-localhost.localdomain.out

Step 2-1-12. Checking the Installation of Hadoop

We know that the jps command will show all the Java processes running on the machine. If everything is fine, it will show all the processes running as follows:
[pysparksqlbook@localhost binaries]$ jps
Here is the output:
13441 NodeManager
13250 ResourceManager
12054 DataNode
14054 Jps
12423 SecondaryNameNode
11898 NameNode

Congratulations to us. We have finally installed Hadoop on our system.

Step 2-1-13. Stopping the Hadoop Processes

As we started Hadoop using the two scripts start-dfs.sh and start-yarn.sh in sequence, in a similar fashion, we can stop the Hadoop process using the stop-dfs.sh and stop-yarn.sh shell scripts , in sequence.
[pysparksqlbook@localhost binaries]$/allBigData/hadoop/sbin/stop-dfs.sh
Here is the output:
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
[pysparksqlbook@localhost binaries]$/allBigData/hadoop/sbin/stop-yarn.sh
Here is the output :
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager

Recipe 2-2. Install Spark on a Single Machine

Problem

You want to install Spark on a single machine.

Solution

We are going to install prebuilt spark-2.3.0 for Hadoop version 2.7. We could build Spark from the source code. But we are going to use the prebuilt Apache Spark.

How It Works

Follow these steps to complete the installation.

Step 2-2-1. Downloading Apache Spark

We are going to download Spark from its mirror. We are going to use the wget command for that purpose, as follows.
[pysparksqlbook@localhost binaries]$ wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
Here is the output:
--2018-06-26 12:48:38--  https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
Resolving archive.apache.org (archive.apache.org)... 163.172.17.199
Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 226128401 (216M) [application/x-gzip]
Saving to: 'spark-2.3.0-bin-hadoop2.7.tgz'
100%[======================================>] 226,128,401  392KB/s   in 5m 34s
2018-06-26 12:54:13 (662 KB/s) - 'spark-2.3.0-bin-hadoop2.7.tgz' saved [226128401/226128401]

Step 2-2-2. Extracting the .tgz File of Spark

The following command will extract the .tgz file .
[pysparksqlbook@localhost binaries]$ tar xvzf spark-2.3.0-bin-hadoop2.7.tgz

Step 2-2-3. Moving the Extracted Spark Directory to /allBigData

Now we have to move the extracted Spark directory to the /allBigData location. The following command will do this.
[pysparksqlbook@localhost binaries]$ sudo mv  spark-2.3.0-bin-hadoop2.7   /allBigData/spark

Step 2-2-4. Changing the Spark Environment File

The Spark environment file possesses all the environment variables required to run Spark. We are going to set the following environmental variables in the environmental file.
  • HADOOP_CONF_DIR: Configuration directory of Hadoop.

  • SPARK_CONF_DIR: Alternate conf directory (Default: ${SPARK_HOME}/conf)

  • SPARK_LOG_DIR: Where log files are stored (Default: ${SPARK_HOME}/log)

  • SPARK_WORKER_DIR: To set the working directory of any worker processes

  • HIVE_CONF_DIR: To read data from Hive

At first we have to copy the spark-env.sh.template file to spark-env.sh. The Spark environment file named spark-env.sh is found inside the spark/conf file (configuration directory location):
[pysparksqlbook@localhost binaries]$ cp /allBigData/spark/conf/spark-env.sh.template /allBigData/spark/conf/spark-env.sh
Now let’s open the spark-env.sh file .
[pysparksqlbook@localhost binaries]$ vim /allBigData/spark/conf/spark-env.sh
Now we append the following lines to the end of spark-env.sh:
export HADOOP_CONF_DIR=/allBigData/hadoop/etc/hadoop/
export SPARK_LOG_DIR=/allBigData/logSpark/
export SPARK_WORKER_DIR=/tmp/spark
export HIVE_CONF_DIR=/allBigData/hive/conf

Step 2-2-5. Amending the .bashrc File

In the .bashrc file , we have to add a Spark bin directory. We can use the following commands to add this.
[pysparksqlbook@localhost binaries]$ vim  ~/.bashrc
Add the following lines to the .bashrc file.
export SPARK_HOME=/allBigData/spark
export PATH=$PATH:$SPARK_HOME/bin
After this, source the .bashrc file.
[pysparksqlbook@localhost binaries]$ source  ~/.bashrc

Step 2-2-6. Starting the PySpark Shell

We can start the PySpark shell using the pyspark script . Discussion about the pyspark script will continue in the next recipe.
[pysparksqlbook@localhost binaries]$ pyspark

We have one more successful installation under our belt. But we have to go further. More installation is required to move through this book. But before all that, it is better to concentrate on the PySpark shell.

Recipe 2-3. Use the PySpark Shell

Problem

You want to use the PySpark shell.

Solution

The PySpark shell is an interactive shell to interact with PySpark using Python. The PySpark shell can be started using the pyspark script. The pyspark script can be found at spark/bin.

How It Works

The PySpark shell can be started as follows.
[pysparksqlbook@localhost binaries]$ pyspark
After starting, it will show the screen in Figure 2-1.
../images/469054_1_En_2_Chapter/469054_1_En_2_Fig1_HTML.jpg
Figure 2-1

Startup console screen in PySpark

We can observe that, after starting PySpark, it displays lots of information. It displays information about the Python and PySpark versions it is using.

The >>> symbol is Python’s command prompt. Whenever we start the Python shell, we get this symbol. It tells us that we can now write our Python commands. Similarly in PySpark, it tells us that we can now write our Python or PySpark command and see the result.

The PySpark shell works in a similar fashion on a single machine installation and a cluster installation of PySpark.

Recipe 2-4. Install Hive on a Single Machine

Problem

You want to install Hive on a single machine.

Solution

We discussed Hive in the first chapter. Now it is time to install Hive on our machine. We are going to read data from Hive to PySparkSQL in coming chapters.

How It Works

Follow these steps to complete the Hive installation.

Step 2-4-1. Downloading Hive

We can download Hive from the Apache Hive website. We can download the Hive tar.gz file using the wget command, as follows.
[pysparksqlbook@localhost binaries]$   wget http://mirrors.fibergrid.in/apache/hive/stable-2/apache-hive-2.3.3-bin.tar.gz
Here is the output:
--2018-06-26 18:24:09--  http://mirrors.fibergrid.in/apache/hive/stable-2/apache-hive-2.3.3-bin.tar.gz
Resolving mirrors.fibergrid.in (mirrors.fibergrid.in)... 103.116.36.9, 2402:f4c0::9
Connecting to mirrors.fibergrid.in (mirrors.fibergrid.in)|103.116.36.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 232229830 (221M) [application/x-gzip]
Saving to: 'apache-hive-2.3.3-bin.tar.gz'
100%[======================================>] 232,229,830 1.04MB/s   in 4m 29s
2018-06-26 18:28:39 (842 KB/s) - 'apache-hive-2.3.3-bin.tar.gz' saved [232229830/232229830]

Step 2-4-2. Extracting Hive

We have downloaded the apache-hive-2.3.3-bin.tar.gz file. It is a .tar.gz file, so we have to extract it. We can extract it using the tar command as follows.
[pysparksqlbook@localhost binaries]$ tar xvzf   apache-hive-2.3.3-bin.tar.gz

Step 2-4-3. Moving the Extracted Hive Directory

[pysparksqlbook@localhost binaries]$ sudo mv apache-hive-2.3.3-bin /allBigData/hive

Step 2-4-4. Updating hive-site.xml

Hive is dispatched with an embedded Derby database for metastore. The Derby database is memory-less. Hence, it is better to provide a definite location for it. We provide that location in hive-site.xml. For that, we have to move hive-default.xml.template to hive-site.xml.
[pysparksqlbook@localhost binaries]$ mv /allBigData/hive/conf/hive-default.xml.template /allBigData/hive/conf/hive-default.xml.templatehive-site.xml
Then open hive-site.xml and update the following:
[pysparksqlbook@localhost binaries]$ vim /allBigData/hive/conf/hive-site.xml
Either add the following line to the end of hive-site.xml or change javax.jdo.option.ConnectionURL in the hive-site.xml file.
<name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/allBigData/hive/metastore/metastore_db;create=true</value>
After that, we have to provide HADOOP_HOME to the hive_env.sh file. The following code shows the command to achieve this.
[pysparksqlbook@localhost binaries]$ mv /allBigData/hive/conf/hive-env.sh.template /allBigData/hive/conf/hive-env.sh
[pysparksqlbook@localhost binaries]$ vim  /allBigData/hive/conf/hive-env.sh
And in hive-env.sh, add the following line.
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/allBigData/hadoop

Step 2-4-5. Updating the .bashrc File

Open the .bashrc file . This file stays in the home directory.
[pysparksqlbook@localhost binaries]$ vim  ~/.bashrc
Add the following lines to the .bashrc file:
####################Hive Parameters ######################
export HIVE_HOME=/allBigData/hive
export PATH=$PATH:$HIVE_HOME/bin
Now source the .bashrc file using the following command.
[pysparksqlbook@localhost binaries]$ source ~/.bashrc

Step 2-4-6. Creating Datawarehouse Directories of Hive

Now we have to create datawarehouse directories. This datawarehouse directory is used by Hive to place the data files:
[pysparksqlbook@localhost binaries]$hadoop fs -mkdir -p /user/hive/warehouse
[pysparksqlbook@localhost binaries]$hadoop fs -mkdir -p /tmp
[pysparksqlbook@localhost binaries]$hadoop fs -chmod g+w /user/hive/warehouse
[pysparksqlbook@localhost binaries]$hadoop fs -chmod g+w /tmp

The /user/hive/warehouse directory is the Hive warehouse directory.

Step 2-4-7. Initiating the Metastore Database

Sometimes it is necessary to initiate schema. You might be thinking, schema of what? We know that Hive stores metadata of tables in a relational database. For the time being, we are going to use the Derby database as a metastore database of Hive. Then, in coming recipes, we are going to connect our Hive to an external PostgreSQL. In Ubuntu, Hive installation works without this command. But in CentOS, I found it indispensable to run. Without the following command, Hive was throwing errors.
[pysparksqlbook@localhost  binaries]$ schematool -initSchema -dbType derby

Step 2-4-8. Checking the Hive Installation

Now Hive has been installed. We should check our work success. We can start the Hive shell using the following command.
[pysparksqlbook@localhost binaries]$ hive
After this command, we will find that the Hive shell has been opened as follows:
hive>

Recipe 2-5. Install PostgreSQL

Problem

You want to install PostgreSQL.

Solution

PostgreSQL is a Relational Database Management System. It was developed at the University of California. It comes under the PostgreSQL License. It provides permission to use, modify, and distribute under the PostgreSQL license. PostgreSQL can run on MacOS X and UNIX-like systems such as Red Hat, Ubuntu, etc. We are going to install it on CentOS.

We are going to use our PostgreSQL in two ways. We will use PostgreSQL as a metastore database for Hive. After having an external database as the metastore, we will be able to read data from the existing Hive easily. The second use of this RDBMS installation is to read data from PostgreSQL, and after analysis, we will save our result to PostgreSQL.

Installing PostgreSQL can be done with source code, but we are going to install it using the command-line yum installer.

How It Works

Follow these steps to complete the PostgreSQL installation.

Step 2-5-1. Installing PostgreSQL

PostgreSQL can be installed using the yum installer. The following code will install PostgreSQL.
[pysparksqlbook@localhost binaries]$ sudo yum install postgresql-server postgresql-contrib
[sudo] password for pysparksqlbook:

Step 2-5-2. Initializing the Database

PostgreSQL can be utilized with a utility called initdb to initialize the database. If we don’t initialize the database, we cannot use it.

At the time of database initialization, we can also specify the data file of the database.

After installing PostgreSQL, we have to initialize it. The database can be initialized using the following command.
[pysparksqlbook@localhost binaries]$ sudo postgresql-setup initdb
Here is the output:
[sudo] password for pysparksqlbook:
Initializing database ... OK

Step 2-5-3. Enabling and Starting the Database

[pysparksqlbook@localhost binaries]$ sudo systemctl enable postgresql
[pysparksqlbook@localhost binaries]$ sudo systemctl start postgresql
[pysparksqlbook@localhost binaries]$ sudo -i -u postgres
Here is the output:
[sudo] password for pysparksqlbook:
-bash-4.2$ psql
psql (9.2.24)
Type "help" for help.
postgres=#

Note

We can get the installation procedure at the following site: https://wiki.postgresql.org/wiki/YUM_Installation .

Recipe 2-6. Configure the Hive Metastore on PostgreSQL

Problem

You want to configure Hive metastore on PostgreSQL.

Solution

As we know, Hive puts a metadata of tables in a relational database. We have already installed Hive, which has an embedded metastore. Hive uses the Derby Relational Database System for a metastore. In coming chapters, we have to read existing Hive tables from PySpark.

Configuration of a Hive metastore on PostgreSQL requires us to populate tables in the PostgreSQL database. These tables will hold metadata of Hive tables. After this, we have to configure the Hive property file.

How It Works

In the following steps, we are going to configure a Hive metastore on the PostgreSQL database. Then our Hive will have metadata in PostgreSQL.

Step 2-6-1. Downloading the PostgreSQL JDBC Connector

We need the JDBC connector so that the Hive process can connect to an external PostgreSQL. We can get the JDBC connector using the following command.
[pysparksqlbook@localhost binaries]$ wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jre6.jar

Step 2-6-2. Copying the JDBC Connector to the Hive lib Directory

After getting the JDBC connector, we have to put it in the Hive lib directory .
[pysparksqlbook@localhost binaries]$ cp postgresql-42.2.5.jre6.jar  /allBigData/hive/lib/

Step 2-6-3. Connecting to PostgreSQL

[pysparksqlbook@localhost binaries]$ sudo -u postgres psql

Step 2-6-4. Creating the Required User and Database

In the following lines of this step, we are going to create a PostgreSQL user named pysparksqlbookUser . Then we are going to create a database named pymetastore. This database is going to hold all the tables related to the Hive metastore.
postgres=# CREATE USER pysparksqlbookUser WITH PASSWORD 'pbook';
Here is the output:
CREATE ROLE
Intro.
postgres=# CREATE DATABASE pymetastore;
Here is the output:
CREATE DATABASE
The c PostgreSQL command stands for connect. We created our database named pymetastore. Now we are going to connect to this database using our c command.
postgres=# c pymetastore;

We are now connected to the pymetastore database . We can see more PostgreSQL commands using the following link.

https://www.postgresql.org/docs/9.2/static/app-psql.html

Step 2-6-5. Populating Data in the pymetastore Database

Hive possesses its own PostgreSQL scripts to populate tables for the metastore. The i command reads commands from the PostgreSQL script and executes those commands. In the following command, we are going to run the hive-txn-schema-2.3.0.postgres.sql script, which will create all the tables required for the Hive metastore.
pymetastore=#  i /allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql
Here is the output:
psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:30: NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "txns_pkey" for table "txns"
CREATE TABLE
CREATE TABLE
INSERT 0 1
psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:69: NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "hive_locks_pkey" for table "hive_locks"
CREATE TABLE

Step 2-6-6. Granting Permissions

The following commands will grant some permissions.
pymetastore=# grant select, insert,update,delete on public.txns to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.txn_components to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_txn_components   to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.next_txn_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.hive_locks to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.next_lock_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.compaction_queue to pysparksqlbookUser;
Here is the output
GRANT
pymetastore=# grant select, insert,update,delete on public.next_compaction_queue_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_compactions to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.aux_table to pysparksqlbookUser;
Here is the output:
GRANT

Step 2-6-7. Changing the pg_hba.conf File

Remember that, in order to update pg_hba.conf , you are supposed to be the root user. So first go to the root user. Then open the pg_hba.conf file.
[root@localhost binaries]# vim /var/lib/pgsql/data/pg_hba.conf
Then change all the peer and ident settings to trust.
#local   all            all             v                  peer
 local   all            all                               trust
# IPv4 local connections:
#host    all            all       127.0.0.1/32            ident
  host   all            all       127.0.0.1/32            trust
# IPv6 local connections:
#host    all            all       ::1/128                 ident
  host   all            all       ::1/128                 trust

More about this change can be found at http://stackoverflow.com/questions/2942485/psql-fatal-ident-authentication-failed-for-user-postgres .

Come out of the root user.

Step 2-6-8. Testing Our User

It is better when we are testing that we make sure we are easily able to enter our database using our created user.
[pysparksqlbook@localhost binaries]$ psql -h localhost -U pysparksqlbookuser -d pymetastore
Here is the output:
psql (9.2.24)
Type "help" for help.
pymetastore=>

Step 2-6-9. Modifying the hive-site.xml File

We can modify Hive-related configurations in its configuration file hive-site.xml . We have to modify the following properties.
  • javax.jdo.option.ConnectionURL: Connecting URL to database

  • javax.jdo.option.ConnectionDriverName: Connection JDBC driver name

  • javax.jdo.option.ConnectionUserName: Database connection user

  • javax.jdo.option.ConnectionPassword: Connection password

We can either modify these properties or we can add the following lines at the end of the Hive property file to get the required result.
<property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:postgresql://localhost/pymetastore</value>
      <description>postgreSQL server metadata store</description>
 </property>
 <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>org.postgresql.Driver</value>
      <description>Driver class of postgreSQL</description>
 </property>
  <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>pysparksqlbookuser</value>
      <description>User name to connect to postgreSQL</description>
 </property>
 <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>pbook</value>
      <description>password for connecting to PostgreSQL server</description>
 </property>

Step 2-6-10. Starting Hive

We have connected Hive to an external relational database management system. So it is time to start Hive and ensure that everything is fine.
[pysparksqlbook@localhost binaries]$ hive
Our activities will be reflected in PostgreSQL. Let’s create a database and a table inside that database. The following commands create a database named apress and a table called apressBooks inside that database.
hive> create database apress;
Here is the output:
OK
Time taken: 1.397 seconds
hive> use apress;
Here is the output:
OK
Time taken: 0.07 seconds
hive> create table apressBooks (
    >      bookName String,
    >      bookWriter String
    >      )
    >      row format delimited
    >      fields terminated by ',';
Here is the output:
OK
Time taken: 0.581 seconds

Step 2-6-11. Testing if Metadata Is Created in PostgreSQL

The database and table we created will be reflected in PostgreSQL. We can see the updated data in the TBLS table, as follows.
pymetastore=> SELECT * from "TBLS";
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER          | RETENTION | SD_ID |TBL_NAME     | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT
---------+-------------+-------+------------------+--------------+-----------+-------+-------------+---------------+
-----+-------------------
       1 |  1482892229 |     6 |                0 | pysparksqlbook|         0 |     1 | apressbooks | MANAGED_TABLE |      |
(1 row)

The appreciable work of connecting Hive to an external database is done. In the following recipe, we are going to install Apache Mesos.

Recipe 2-7. Connect PySpark to Hive

Problem

You want to connect PySpark to Hive.

Solution

PySpark needs a Hive property file to know the configuration parameters of Hive. The Hive property file called hive-site.xml stays in the Hive conf directory. We simply copy the Hive property file to the Spark conf directory. We are done. Now we can start PySpark.

How It Works

Two steps have been identified to connect PySpark to Hive.

Step 2-7-1. Copying the Hive Property File to the Spark conf Directory

[pysparksqlbook@localhost binaries]$cp /allBigData/hive/conf/hive-site.xml /allBigData/spark/conf/

Step 2-7-2. Starting PySpark

[pysparksqlbook@localhost binaries]$pyspark

Recipe 2-8. Install MySQL

Problem

You want to install MySQL Server.

Solution

We can read data from MySQL using PySparkSQL. We also can save the output of our analysis into a MySQL database. We can install MySQL Server using the yum installer.

How It Works

Follow these steps to complete the MySQL installation.

Step 2-8-1. Installing the MySQL Server

The following command will install the MySQL Server.
[pysparksqlbook@localhost binaries]$ sudo yum install mysql-server
Here is the output:
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: centos.excellmedia.net
 * extras: centos.excellmedia.net
Dependency Installed:
  mysql-community-common.x86_64 0:5.6.41-2.el7
  perl-Compress-Raw-Bzip2.x86_64 0:2.061-3.el7
  perl-Compress-Raw-Zlib.x86_64 1:2.061-4.el7
  perl-DBI.x86_64 0:1.627-4.el7
  perl-Data-Dumper.x86_64 0:2.145-3.el7
  perl-IO-Compress.noarch 0:2.061-2.el7
  perl-Net-Daemon.noarch 0:0.48-5.el7
  perl-PlRPC.noarch 0:0.2020-14.el7
Replaced:
  mariadb.x86_64 1:5.5.60-1.el7_5      mariadb-libs.x86_64 1:5.5.60-1.el7_5
Complete!

Finally, we have installed MySQL.

Recipe 2-9. Install MongoDB

Problem

You want to install MongoDB.

Solution

MongoDB is a NoSQL database. It can be installed using the yum installer.

How It Works

Follow these steps to complete the MongoDB installation.

Step 2-9-1. Installing MongoDB

[pysparksqlbook@localhost book]$ sudo vim /etc/yum.repos.d/mongodb-org-4.0.repo
Inside this mongodb-org-4.0.repo file, copy the following:
[mongodb-org-4.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/4.0/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc

Note

You can get details about the MongoDB installation on Centos from the following link: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/ .

[pysparksqlbook@localhost book]$ sudo yum install -y mongodb-org-4.0.0 mongodb-org-server-4.0.0 mongodb-org-shell-4.0.0 mongodb-org-mongos-4.0.0 mongodb-org-tools-4.0.0
Here is the output:
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: centos.excellmedia.net
 * extras: centos.excellmedia.net
 * updates: centos.excellmedia.net
Installed:
  mongodb-org.x86_64 0:4.0.0-1.el7
  mongodb-org-mongos.x86_64 0:4.0.0-1.el7
  mongodb-org-server.x86_64 0:4.0.0-1.el7
  mongodb-org-shell.x86_64 0:4.0.0-1.el7
  mongodb-org-tools.x86_64 0:4.0.0-1.el7
Complete!

Step 2-9-2. Creating a Data Directory

We are going to create a data directory for MongoDB.
[pysparksqlbook@localhost book]$ sudo mkdir -p /data/db
[pysparksqlbook@localhost book]$ sudo chown pysparksqlbook:pysparksqlbook -R /data

Step 2-9-3. Starting the MongoDB Server

The MongoDB server will be started using the mongod command.
[pysparksqlbook@localhost binaries]$ mongod
Here is the output:
2018-08-26T17:59:42.029-0400 I CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2018-08-26T17:59:42.051-0400 I CONTROL  [initandlisten] MongoDB starting : pid=22570 port=27017 dbpath=/data/db 64-bit host=localhost.localdomain
2018-08-26T17:59:42.051-0400 I CONTROL  [initandlisten] db version v4.0.0
2018-08-26T18:06:19.459-0400 I NETWORK  [conn1] end connection 127.0.0.1:59690 (0 connections now open)
Note that the MongoDB server is listening on 127.0.0.1:59690. Now we can start the MongoDB shell, called mongo.
[pysparksqlbook@localhost book]$ mongo
Here is the output:
MongoDB shell version v4.0.0
connecting to: mongodb://127.0.0.1:27017
To enable free monitoring, run the following command:
db.enableFreeMonitoring()
>

Recipe 2-10. Install Cassandra

Problem

You want to install Cassandra.

Solution

Cassandra is a NoSQL database. It can be installed using the yum installer.

How It Works

Follow these steps to complete the Cassandra installation:
[pysparksqlbook@localhost ~]$ sudo vim /etc/yum.repos.d/cassandra.repo
[sudo] password for pysparksqlbook:
Copy the following lines in the cassandra.repo file:
[cassandra]
name=Apache Cassandra
baseurl=https://www.apache.org/dist/cassandra/redhat/311x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://www.apache.org/dist/cassandra/KEYS
[pysparksqlbook@localhost ~]$ sudo yum -y install cassandra
Here is the output:
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
Running transaction
  Installing : cassandra-3.11.3-1.noarch        1/1
  Verifying  : cassandra-3.11.3-1.noarch        1/1
Installed:
  cassandra.noarch 0:3.11.3-1
Complete!
Now we start the server:
[pysparksqlbook@localhost ~]$ systemctl daemon-reload
[pysparksqlbook@localhost ~]$ systemctl start cassandra
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset