Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Raju Kumar Mishra and Sundar Rajan RamanPySpark SQL Recipeshttps://doi.org/10.1007/978-1-4842-4335-0_2

2. Installation

Raju Kumar Mishra¹ and Sundar Rajan Raman²

(1)

Bangalore, Karnataka, India

(2)

Chennai, Tamil Nadu, India

In the upcoming chapters, we are going to solve many problems using PySpark. PySpark also interacts with many other Big Data frameworks to provide end-to-end solutions. PySpark might read data from HDFS, NoSQL databases, or a relational database management system. After data analysis, we can save the results into HDFS or databases.

This chapter deals with all the software installations that are required to go through this book. We are going to install all the required Big Data frameworks on the CentOS operating system. CentOS is an enterprise-class operating system. It is free to use and easily available. We can download CentOS from the https://www.centos.org/download/ link and install it on a virtual machine.

In this chapter, we are going to discuss the following recipes:

Recipe 2-1. Install Hadoop on a single machine
Recipe 2-2. Install Spark on a single machine
Recipe 2-3. Use the PySpark shell
Recipe 2-4. Install Hive on a single machine
Recipe 2-5. Install PostgreSQL
Recipe 2-6. Configure the Hive metastore on PostgreSQL
Recipe 2-7. Connect PySpark to Hive
Recipe 2-8. Install MySQL
Recipe 2-9. Install MongoDB
Recipe 2-10. Install Cassandra

I suggest that you install every piece of software on your own. It is a good exercise and will give you a deeper understanding of the components of each software package.

Recipe 2-1. Install Hadoop on a Single Machine

Problem

You want to install Hadoop on a single machine.

Solution

You might be wondering, why are we installing Hadoop while we are learning PySpark? Are we going to use Hadoop MapReduce as the distributed framework for our problem solving? The answer is, not at all. We are going to use two components of Hadoop—HDFS and YARN. HDFS for data storage and YARN as a cluster manager. In order to install Hadoop, we need to download and configure it.

How It Works

Follow these steps to complete the Hadoop installation.

Step 2-1-1. Creating a New CentOS User

A new user is created. You might be thinking, why a new user? Why can’t we install Hadoop in an existing user? The reason behind that is that we want to provide a dedicated user for all the Big Data frameworks. In the following lines of code, we are going to create a user named pysparksqlbook.

[root@localhost book]# adduser pysparksqlbook

[root@localhost book]# passwd pysparksqlbook’

Here is the output:

Changing password for user pysparksqlbook.

New password:

passwd: all authentication tokens updated successfully.

In the above part of the code, we can see that the adduser command has been used to create or add a user. The Linux passwd command has been used to provide a password to our new user pysparksqlbook.

After creating a user, we have to add it to sudo. Sudo stands for “superuser do”. Using sudo, we can run any code as the superuser. Sudo will be used to install the software.

Step 2-1-2. Adding a CentOS user to sudo

Then, we have to add our new user pysparksqlbook to the sudo. The following command will do this.

[root@localhost book]# usermod -aG wheel pysparksqlbook [root@localhost book]# exit

Then we enter our user pysparksqlbook.

[book@localhost ~]$ su pysparksqlbook

We will create two directories—the binaries directory under the home directory to download software and the allBigData directory under the root / directory to install the Big Data frameworks.

[pysparksqlbook@localhost ~]$ mkdir binaries [pysparksqlbook@localhost ~]$ sudo mkdir /allBigData

Step 2-1-3. Installing Java

Hadoop, Hive, Spark, and many Big Data frameworks use JVM. It’s why we are first going to install Java. We are going to use OpenJDK for our purposes. We are installing the 8th version of OpenJDK. We can install Java on CentOS using the yum installer. The following command installs Java using the yum installer.

[pysparksqlbook@localhost binaries]$ sudo yum install java-1.8.0-openjdk

Here is the output:

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

* base: centos.excellmedia.net

* extras: centos.excellmedia.net

Updated:

java-1.8.0-openjdk.x86_64 1:1.8.0.181-3.b13.el7_5

Dependency Updated:

java-1.8.0-openjdk-headless.x86_64 1:1.8.0.181-3.b13.el7_5

Complete!

Java has been installed. After installation of any software, it is a good idea to check the installation. Checking the installation will show you that everything is fine.

In order to check the Java installation, I prefer the java -version command, which will return the version of JVM installed.

[pysparksqlbook@localhost binaries]$ java -version

Here is the output:

openjdk version "1.8.0_181"

OpenJDK Runtime Environment (build 1.8.0_181-b13)

OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Java has been installed. We have to look for the environment variable JAVA_HOME, which is going to be used by all the distributed frameworks. After installing Java, we can find the Java home variable by using jrunscript, as follows.

[pysparksqlbook@localhost binaries]$jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'

Here is the output:

/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

We have found the absolute path of JAVA_HOME .

Step 2-1-4. Creating Password-Less Logging from pysparksqlbook

Here are the steps for creating a password-less login.

[pysparksqlbook@localhost binaries]$ ssh-keygen -t rsa

Here is the output:

Generating public/private rsa key pair.

Enter file in which to save the key (/home/pysparksqlbook/.ssh/id_rsa):

Created directory '/home/pysparksqlbook/.ssh'.

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/pysparksqlbook/.ssh/id_rsa.

Your public key has been saved in /home/pysparksqlbook/.ssh/id_rsa.pub.

The key fingerprint is:

SHA256:DANT7QBm9fDHi1/VRPcytb8/d4PemcOnn0Sm9hzl93A [email protected]

The key's randomart image is:

+---[RSA 2048]----+

| *++. .=|

| o o.+.. ++|

| ooo o +.o|

| +.o . . o.|

| S . . oo|

| . . +.o|

| . o++E|

| ..=B%|

| ...X@|

+----[SHA256]-----+

The next command:

[pysparksqlbook@localhost binaries]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

[pysparksqlbook@localhost binaries]$ chmod 755 ~/.ssh/authorized_keys

[pysparksqlbook@localhost binaries]$ ssh localhost

Here is the output:

The authenticity of host 'localhost (::1)' can't be established.

ECDSA key fingerprint is SHA256:md4M1J6VEYQm3gSynB0gqIYFpesp6I2cRvlEvJOIFFE.

ECDSA key fingerprint is MD5:78:cf:a7:71:2e:38:c2:62:01:65:c2:4c:71:7e:3c:90.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Last login: Thu July 13 17:02:56 2018

Finally:

[pysparksqlbook@localhost ~]$ exit

Here is the output:

logout

Connection to localhost closed.

Step 2-1-5. Downloading Hadoop

We are now going to download Hadoop from the Apache website. As mentioned, we will download all the required packages to the binaries directory. We are going to use the wget command to download Hadoop.

[pysparksqlbook@localhost ~]$ cd binaries

[pysparksqlbook@localhost binaries]$ wget http://mirrors.fibergrid.in/apache/hadoop/common/hadoop-2.7.7/hadoop--.7.7.tar.gz

Here is the output:

--2018-06-26 12:56:36-- http://mirrors.fibergrid.in/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

Resolving mirrors.fibergrid.in (mirrors.fibergrid.in)... 103.116.36.9, 2402:f4c0::9

Connecting to mirrors.fibergrid.in (mirrors.fibergrid.in)|103.116.36.9|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 218720521 (209M) [application/x-gzip]

Saving to: 'hadoop-2.7.7.tar.gz'

100%[======================================>] 218,720,521 38.8KB/s in 63m 45s

2018-06-26 14:00:22 (55.8 KB/s) - 'hadoop-2.7.7.tar.gz' saved [218720521/218720521]

Step 2-1-6. Moving Hadoop Binaries to the Installation Directory

Our installation directory is called allBigData . The downloaded software is in hadoop-2.7.7.tar.gz, which is a compressed directory. So we first have to decompress it. We can decompress it using the tar command as follows:

[pysparksqlbook@localhost binaries]$ tar xvzf hadoop-2.7.7.tar.gz

Now we move Hadoop under the allBigData directory.

pysparksqlbook@localhost binaries]$ sudo mv hadoop-2.7.7 /allBigData/hadoop

Step 2-1-7. Modifying the Hadoop Environment File

We have to make some changes to the Hadoop environment file. The Hadoop environment file is found in the Hadoop configuration directory. In our case, the Hadoop configuration directory is /allBigData/hadoop/etc/hadoop/. In the following line of code, we add JAVA_HOME to the hadoop-env.sh file.

[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hadoop-env.sh

After opening the Hadoop environment file, add the following line.

# The java implementation to use.

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

Step 2-1-8. Modifying the Hadoop Properties Files

We will be focusing on three properties files:

hdfs-site.xml: HDFS properties
core-site.xml: Core properties related to the cluster
mapred-site.xml: Properties for the MapReduce Framework

These properties files will be found in the Hadoop configuration directory. In the previous chapter, we discussed HDFS. We found that HDFS has two components—NameNode and DataNode. We also discussed that HDFS does data replication for fault tolerance. In our hdfs-site.xml file, we are going to set the namenode directory using the dfs.name.dir parameter, the datanode directory using the dfs.data.dir parameter, and the replication factor using the dfs.replication parameter.

Let’s modify hdfs-site.xml.

[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hdfs-site.xml

After opening hdfs-site.xml , we have to add the following lines to that file.

<value>file:/allBigData/hdfs/namenode</value>

<description>NameNode location</description>

</property>

<value>file:/allBigData/hdfs/datanode</value>

<description>DataNode location</description>

</property>

<name>dfs.replication</name>

<description> Number of block replication </description>

</property>

After updating hdfs-site.xml, we are going to update core-site.xml. In core-site.xml, we are going to update two properties—fs.default.name and hadoop.tmp.dir. The fs.default.name property is used to determine the host, port, etc. of the filesystem. The hadoop.tmp.dir property determines the temporary directories for Hadoop. We have to add the following lines to core-site.xml.

<name>fs.default.name</name>

<value>hdfs://localhost:9745</value>

<description>Host port of file system</description>

</property>

<name>hadoop.tmp.dir</name>

<value>/application/hadoop/tmp</value>

<description>Temp directory for other working and tmp directories</description>

</property>

Finally, we are going to modify mapred-site.xml. We are going to modify mapreduce.framework.name, which will decide which runtime framework is to be used. The possible values are local, classic, or yarn. We have to add the following code to the mapred-site.xml file.

[pysparksqlbook@localhost binaries]$ cp /allBigData/hadoop/etc/hadoop/mapred-site.xml.template /allBigData/hadoop/etc/hadoop/mapred-site.xml

[pysparksqlbook@localhost binaries]$vim /allBigData/hadoop/etc/hadoop/mapred-site.xml

Here is the XML:

<name>mapreduce.framework.name</name>

</property>

Let’s create the temporary directory:

[pysparksqlbook@localhost binaries]$ sudo mkdir -p /application/hadoop/tmp

[pysparksqlbook@localhost binaries]$ sudo chown pysparksqlbook:pysparksqlbook -R /application/hadoop

Step 2-1-9. Updating the .bashrc File

The following lines have to be added to the .bashrc file . Open .bashrc and append it by adding the following lines.

[pysparksqlbook@localhost binaries]$ vim ~/.bashrc

In the .bashrc file, add the following line at the end.

export HADOOP_HOME=/allBigData/hadoop

export PATH=$PATH:$HADOOP_HOME/sbin

export PATH=$PATH:$HADOOP_HOME/bin

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181-3.b13.el7_5.x86_64/jre

export PATH=$PATH:$JAVA_HOME/bin

Then we have to source the .bashrc file . After sourcing the file, new updated values will be reflected in the console.

[pysparksqlbook@localhost binaries]$ source ~/.bashrc

Step 2-1-10. Running the Namenode Format

We have updated some property files. We are supposed to run namenode format, so that all the changes will be reflected in our framework. The following command will format namenode.

[pysparksqlbook@localhost binaries]$ hdfs namenode -format

Here is the output:

18/06/13 18:14:49 INFO namenode.FSImageFormatProtobuf: Image file /allBigData/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 331 bytes saved in 0 seconds.

18/06/13 18:14:50 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

18/06/13 18:14:50 INFO util.ExitUtil: Exiting with status 0

18/06/13 18:14:50 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1

************************************************************

Step 2-1-11. Starting Hadoop

Hadoop has been installed. We have to start Hadoop now. We can find the Hadoop starting script in /allBigData/hadoop/sbin/. We have to run the start-dfs.sh and start-yarn.sh scripts , in sequence.

[pysparksqlbook@localhost binaries]$ /allBigData/hadoop/sbin/start-dfs.sh

Here is the output:

Starting namenodes on [localhost]

localhost: starting namenode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-namenode-localhost.localdomain.out

localhost: starting datanode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-datanode-localhost.localdomain.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.

ECDSA key fingerprint is SHA256:md4M1J6VEYQm3gSynB0gqIYFpesp6I2cRvlEvJOIFFE.

ECDSA key fingerprint is MD5:78:cf:a7:71:2e:38:c2:62:01:65:c2:4c:71:7e:3c:90.

Are you sure you want to continue connecting (yes/no)? yes

0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to /allBigData/hadoop/logs/hadoop-pysparksqlbook-secondarynamenode-localhost.localdomain.out

[pysparksqlbook@localhost binaries]$ /allBigData/hadoop/sbin/start-yarn.sh

Here is the output:

starting yarn daemons

starting resourcemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-resourcemanager-localhost.localdomain.out

localhost: starting nodemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-nodemanager-localhost.localdomain.out

Step 2-1-12. Checking the Installation of Hadoop

We know that the jps command will show all the Java processes running on the machine. If everything is fine, it will show all the processes running as follows:

[pysparksqlbook@localhost binaries]$ jps

Here is the output:

13441 NodeManager

13250 ResourceManager

12054 DataNode

14054 Jps

12423 SecondaryNameNode

11898 NameNode

Congratulations to us. We have finally installed Hadoop on our system.

Step 2-1-13. Stopping the Hadoop Processes

As we started Hadoop using the two scripts start-dfs.sh and start-yarn.sh in sequence, in a similar fashion, we can stop the Hadoop process using the stop-dfs.sh and stop-yarn.sh shell scripts , in sequence.

[pysparksqlbook@localhost binaries]$/allBigData/hadoop/sbin/stop-dfs.sh

Here is the output:

Stopping namenodes on [localhost]

localhost: stopping namenode

localhost: stopping datanode

Stopping secondary namenodes [0.0.0.0]

0.0.0.0: stopping secondarynamenode

[pysparksqlbook@localhost binaries]$/allBigData/hadoop/sbin/stop-yarn.sh

Here is the output :

stopping yarn daemons

stopping resourcemanager

localhost: stopping nodemanager

Recipe 2-2. Install Spark on a Single Machine

Problem

You want to install Spark on a single machine.

Solution

We are going to install prebuilt spark-2.3.0 for Hadoop version 2.7. We could build Spark from the source code. But we are going to use the prebuilt Apache Spark.

How It Works

Follow these steps to complete the installation.

Step 2-2-1. Downloading Apache Spark

We are going to download Spark from its mirror. We are going to use the wget command for that purpose, as follows.

[pysparksqlbook@localhost binaries]$ wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Here is the output:

--2018-06-26 12:48:38-- https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Resolving archive.apache.org (archive.apache.org)... 163.172.17.199

Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 226128401 (216M) [application/x-gzip]

Saving to: 'spark-2.3.0-bin-hadoop2.7.tgz'

100%[======================================>] 226,128,401 392KB/s in 5m 34s

2018-06-26 12:54:13 (662 KB/s) - 'spark-2.3.0-bin-hadoop2.7.tgz' saved [226128401/226128401]

Step 2-2-2. Extracting the .tgz File of Spark

The following command will extract the .tgz file .

[pysparksqlbook@localhost binaries]$ tar xvzf spark-2.3.0-bin-hadoop2.7.tgz

Step 2-2-3. Moving the Extracted Spark Directory to /allBigData

Now we have to move the extracted Spark directory to the /allBigData location. The following command will do this.

[pysparksqlbook@localhost binaries]$ sudo mv spark-2.3.0-bin-hadoop2.7 /allBigData/spark

Step 2-2-4. Changing the Spark Environment File

The Spark environment file possesses all the environment variables required to run Spark. We are going to set the following environmental variables in the environmental file.

HADOOP_CONF_DIR: Configuration directory of Hadoop.
SPARK_CONF_DIR: Alternate conf directory (Default: ${SPARK_HOME}/conf)
SPARK_LOG_DIR: Where log files are stored (Default: ${SPARK_HOME}/log)
SPARK_WORKER_DIR: To set the working directory of any worker processes
HIVE_CONF_DIR: To read data from Hive

At first we have to copy the spark-env.sh.template file to spark-env.sh. The Spark environment file named spark-env.sh is found inside the spark/conf file (configuration directory location):

[pysparksqlbook@localhost binaries]$ cp /allBigData/spark/conf/spark-env.sh.template /allBigData/spark/conf/spark-env.sh

Now let’s open the spark-env.sh file .

[pysparksqlbook@localhost binaries]$ vim /allBigData/spark/conf/spark-env.sh

Now we append the following lines to the end of spark-env.sh:

export HADOOP_CONF_DIR=/allBigData/hadoop/etc/hadoop/

export SPARK_LOG_DIR=/allBigData/logSpark/

export SPARK_WORKER_DIR=/tmp/spark

export HIVE_CONF_DIR=/allBigData/hive/conf

Step 2-2-5. Amending the .bashrc File

In the .bashrc file , we have to add a Spark bin directory. We can use the following commands to add this.

[pysparksqlbook@localhost binaries]$ vim ~/.bashrc

Add the following lines to the .bashrc file.

export SPARK_HOME=/allBigData/spark

export PATH=$PATH:$SPARK_HOME/bin

After this, source the .bashrc file.

[pysparksqlbook@localhost binaries]$ source ~/.bashrc

Step 2-2-6. Starting the PySpark Shell

We can start the PySpark shell using the pyspark script . Discussion about the pyspark script will continue in the next recipe.

[pysparksqlbook@localhost binaries]$ pyspark

We have one more successful installation under our belt. But we have to go further. More installation is required to move through this book. But before all that, it is better to concentrate on the PySpark shell.

Recipe 2-3. Use the PySpark Shell

Problem

You want to use the PySpark shell.

Solution

The PySpark shell is an interactive shell to interact with PySpark using Python. The PySpark shell can be started using the pyspark script. The pyspark script can be found at spark/bin.

How It Works

The PySpark shell can be started as follows.

[pysparksqlbook@localhost binaries]$ pyspark

After starting, it will show the screen in Figure 2-1.

../images/469054_1_En_2_Chapter/469054_1_En_2_Fig1_HTML.jpg — Figure 2-1
Startup console screen in PySpark

We can observe that, after starting PySpark, it displays lots of information. It displays information about the Python and PySpark versions it is using.

The >>> symbol is Python’s command prompt. Whenever we start the Python shell, we get this symbol. It tells us that we can now write our Python commands. Similarly in PySpark, it tells us that we can now write our Python or PySpark command and see the result.

The PySpark shell works in a similar fashion on a single machine installation and a cluster installation of PySpark.

Recipe 2-4. Install Hive on a Single Machine

Problem

You want to install Hive on a single machine.

Solution

We discussed Hive in the first chapter. Now it is time to install Hive on our machine. We are going to read data from Hive to PySparkSQL in coming chapters.

How It Works

Follow these steps to complete the Hive installation.

Step 2-4-1. Downloading Hive

We can download Hive from the Apache Hive website. We can download the Hive tar.gz file using the wget command, as follows.

[pysparksqlbook@localhost binaries]$ wget http://mirrors.fibergrid.in/apache/hive/stable-2/apache-hive-2.3.3-bin.tar.gz

Here is the output:

--2018-06-26 18:24:09-- http://mirrors.fibergrid.in/apache/hive/stable-2/apache-hive-2.3.3-bin.tar.gz

Resolving mirrors.fibergrid.in (mirrors.fibergrid.in)... 103.116.36.9, 2402:f4c0::9

Connecting to mirrors.fibergrid.in (mirrors.fibergrid.in)|103.116.36.9|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 232229830 (221M) [application/x-gzip]

Saving to: 'apache-hive-2.3.3-bin.tar.gz'

100%[======================================>] 232,229,830 1.04MB/s in 4m 29s

2018-06-26 18:28:39 (842 KB/s) - 'apache-hive-2.3.3-bin.tar.gz' saved [232229830/232229830]

Step 2-4-2. Extracting Hive

We have downloaded the apache-hive-2.3.3-bin.tar.gz file. It is a .tar.gz file, so we have to extract it. We can extract it using the tar command as follows.

[pysparksqlbook@localhost binaries]$ tar xvzf apache-hive-2.3.3-bin.tar.gz

Step 2-4-3. Moving the Extracted Hive Directory

[pysparksqlbook@localhost binaries]$ sudo mv apache-hive-2.3.3-bin /allBigData/hive

Step 2-4-4. Updating hive-site.xml

Hive is dispatched with an embedded Derby database for metastore. The Derby database is memory-less. Hence, it is better to provide a definite location for it. We provide that location in hive-site.xml. For that, we have to move hive-default.xml.template to hive-site.xml.

[pysparksqlbook@localhost binaries]$ mv /allBigData/hive/conf/hive-default.xml.template /allBigData/hive/conf/hive-default.xml.templatehive-site.xml

Then open hive-site.xml and update the following:

[pysparksqlbook@localhost binaries]$ vim /allBigData/hive/conf/hive-site.xml

Either add the following line to the end of hive-site.xml or change javax.jdo.option.ConnectionURL in the hive-site.xml file.

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=/allBigData/hive/metastore/metastore_db;create=true</value>

After that, we have to provide HADOOP_HOME to the hive_env.sh file. The following code shows the command to achieve this.

[pysparksqlbook@localhost binaries]$ mv /allBigData/hive/conf/hive-env.sh.template /allBigData/hive/conf/hive-env.sh

[pysparksqlbook@localhost binaries]$ vim /allBigData/hive/conf/hive-env.sh

And in hive-env.sh, add the following line.

# Set HADOOP_HOME to point to a specific hadoop install directory

HADOOP_HOME=/allBigData/hadoop

Step 2-4-5. Updating the .bashrc File

Open the .bashrc file . This file stays in the home directory.

[pysparksqlbook@localhost binaries]$ vim ~/.bashrc

Add the following lines to the .bashrc file:

####################Hive Parameters ######################

export HIVE_HOME=/allBigData/hive

export PATH=$PATH:$HIVE_HOME/bin

Now source the .bashrc file using the following command.

[pysparksqlbook@localhost binaries]$ source ~/.bashrc

Step 2-4-6. Creating Datawarehouse Directories of Hive

Now we have to create datawarehouse directories. This datawarehouse directory is used by Hive to place the data files:

[pysparksqlbook@localhost binaries]$hadoop fs -mkdir -p /user/hive/warehouse

[pysparksqlbook@localhost binaries]$hadoop fs -mkdir -p /tmp

[pysparksqlbook@localhost binaries]$hadoop fs -chmod g+w /user/hive/warehouse

[pysparksqlbook@localhost binaries]$hadoop fs -chmod g+w /tmp

The /user/hive/warehouse directory is the Hive warehouse directory.

Step 2-4-7. Initiating the Metastore Database

Sometimes it is necessary to initiate schema. You might be thinking, schema of what? We know that Hive stores metadata of tables in a relational database. For the time being, we are going to use the Derby database as a metastore database of Hive. Then, in coming recipes, we are going to connect our Hive to an external PostgreSQL. In Ubuntu, Hive installation works without this command. But in CentOS, I found it indispensable to run. Without the following command, Hive was throwing errors.

[pysparksqlbook@localhost binaries]$ schematool -initSchema -dbType derby

Step 2-4-8. Checking the Hive Installation

Now Hive has been installed. We should check our work success. We can start the Hive shell using the following command.

[pysparksqlbook@localhost binaries]$ hive

After this command, we will find that the Hive shell has been opened as follows:

hive>

Recipe 2-5. Install PostgreSQL

Problem

You want to install PostgreSQL.

Solution

PostgreSQL is a Relational Database Management System. It was developed at the University of California. It comes under the PostgreSQL License. It provides permission to use, modify, and distribute under the PostgreSQL license. PostgreSQL can run on MacOS X and UNIX-like systems such as Red Hat, Ubuntu, etc. We are going to install it on CentOS.

We are going to use our PostgreSQL in two ways. We will use PostgreSQL as a metastore database for Hive. After having an external database as the metastore, we will be able to read data from the existing Hive easily. The second use of this RDBMS installation is to read data from PostgreSQL, and after analysis, we will save our result to PostgreSQL.

Installing PostgreSQL can be done with source code, but we are going to install it using the command-line yum installer.

How It Works

Follow these steps to complete the PostgreSQL installation.

Step 2-5-1. Installing PostgreSQL

PostgreSQL can be installed using the yum installer. The following code will install PostgreSQL.

[pysparksqlbook@localhost binaries]$ sudo yum install postgresql-server postgresql-contrib

[sudo] password for pysparksqlbook:

Step 2-5-2. Initializing the Database

PostgreSQL can be utilized with a utility called initdb to initialize the database. If we don’t initialize the database, we cannot use it.

At the time of database initialization, we can also specify the data file of the database.

After installing PostgreSQL, we have to initialize it. The database can be initialized using the following command.

[pysparksqlbook@localhost binaries]$ sudo postgresql-setup initdb

Here is the output:

[sudo] password for pysparksqlbook:

Initializing database ... OK

Step 2-5-3. Enabling and Starting the Database

[pysparksqlbook@localhost binaries]$ sudo systemctl enable postgresql

[pysparksqlbook@localhost binaries]$ sudo systemctl start postgresql

[pysparksqlbook@localhost binaries]$ sudo -i -u postgres

Here is the output:

[sudo] password for pysparksqlbook:

-bash-4.2$ psql

psql (9.2.24)

Type "help" for help.

postgres=#

Note

We can get the installation procedure at the following site: https://wiki.postgresql.org/wiki/YUM_Installation .

Recipe 2-6. Configure the Hive Metastore on PostgreSQL

Problem

You want to configure Hive metastore on PostgreSQL.

Solution

As we know, Hive puts a metadata of tables in a relational database. We have already installed Hive, which has an embedded metastore. Hive uses the Derby Relational Database System for a metastore. In coming chapters, we have to read existing Hive tables from PySpark.

Configuration of a Hive metastore on PostgreSQL requires us to populate tables in the PostgreSQL database. These tables will hold metadata of Hive tables. After this, we have to configure the Hive property file.

How It Works

In the following steps, we are going to configure a Hive metastore on the PostgreSQL database. Then our Hive will have metadata in PostgreSQL.

Step 2-6-1. Downloading the PostgreSQL JDBC Connector

We need the JDBC connector so that the Hive process can connect to an external PostgreSQL. We can get the JDBC connector using the following command.

[pysparksqlbook@localhost binaries]$ wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jre6.jar

Step 2-6-2. Copying the JDBC Connector to the Hive lib Directory

After getting the JDBC connector, we have to put it in the Hive lib directory .

[pysparksqlbook@localhost binaries]$ cp postgresql-42.2.5.jre6.jar /allBigData/hive/lib/

Step 2-6-3. Connecting to PostgreSQL

[pysparksqlbook@localhost binaries]$ sudo -u postgres psql

Step 2-6-4. Creating the Required User and Database

In the following lines of this step, we are going to create a PostgreSQL user named pysparksqlbookUser . Then we are going to create a database named pymetastore. This database is going to hold all the tables related to the Hive metastore.

postgres=# CREATE USER pysparksqlbookUser WITH PASSWORD 'pbook';

Here is the output:

CREATE ROLE

Intro.

postgres=# CREATE DATABASE pymetastore;

Here is the output:

CREATE DATABASE

The c PostgreSQL command stands for connect. We created our database named pymetastore. Now we are going to connect to this database using our c command.

postgres=# c pymetastore;

We are now connected to the pymetastore database . We can see more PostgreSQL commands using the following link.

https://www.postgresql.org/docs/9.2/static/app-psql.html

Step 2-6-5. Populating Data in the pymetastore Database

Hive possesses its own PostgreSQL scripts to populate tables for the metastore. The i command reads commands from the PostgreSQL script and executes those commands. In the following command, we are going to run the hive-txn-schema-2.3.0.postgres.sql script, which will create all the tables required for the Hive metastore.

pymetastore=# i /allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql

Here is the output:

psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "txns_pkey" for table "txns"

CREATE TABLE

INSERT 0 1

psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:69: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "hive_locks_pkey" for table "hive_locks"

CREATE TABLE

Step 2-6-6. Granting Permissions

The following commands will grant some permissions.

pymetastore=# grant select, insert,update,delete on public.txns to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.txn_components to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.completed_txn_components to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.next_txn_id to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.hive_locks to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.next_lock_id to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.compaction_queue to pysparksqlbookUser;

Here is the output

GRANT

pymetastore=# grant select, insert,update,delete on public.next_compaction_queue_id to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.completed_compactions to pysparksqlbookUser;

Here is the output:

GRANT

pymetastore=# grant select, insert,update,delete on public.aux_table to pysparksqlbookUser;

Here is the output:

GRANT

Step 2-6-7. Changing the pg_hba.conf File

Remember that, in order to update pg_hba.conf , you are supposed to be the root user. So first go to the root user. Then open the pg_hba.conf file.

[root@localhost binaries]# vim /var/lib/pgsql/data/pg_hba.conf

Then change all the peer and ident settings to trust.

#local all all v peer

local all all trust

# IPv4 local connections:

#host all all 127.0.0.1/32 ident

host all all 127.0.0.1/32 trust

# IPv6 local connections:

#host all all ::1/128 ident

host all all ::1/128 trust

More about this change can be found at http://stackoverflow.com/questions/2942485/psql-fatal-ident-authentication-failed-for-user-postgres .

Come out of the root user.

Step 2-6-8. Testing Our User

It is better when we are testing that we make sure we are easily able to enter our database using our created user.

[pysparksqlbook@localhost binaries]$ psql -h localhost -U pysparksqlbookuser -d pymetastore

Here is the output:

psql (9.2.24)

Type "help" for help.

pymetastore=>

Step 2-6-9. Modifying the hive-site.xml File

We can modify Hive-related configurations in its configuration file hive-site.xml . We have to modify the following properties.

javax.jdo.option.ConnectionURL: Connecting URL to database
javax.jdo.option.ConnectionDriverName: Connection JDBC driver name
javax.jdo.option.ConnectionUserName: Database connection user
javax.jdo.option.ConnectionPassword: Connection password

We can either modify these properties or we can add the following lines at the end of the Hive property file to get the required result.

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:postgresql://localhost/pymetastore</value>

<description>postgreSQL server metadata store</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.postgresql.Driver</value>

<description>Driver class of postgreSQL</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>pysparksqlbookuser</value>

<description>User name to connect to postgreSQL</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>pbook</value>

<description>password for connecting to PostgreSQL server</description>

</property>

Step 2-6-10. Starting Hive

We have connected Hive to an external relational database management system. So it is time to start Hive and ensure that everything is fine.

[pysparksqlbook@localhost binaries]$ hive

Our activities will be reflected in PostgreSQL. Let’s create a database and a table inside that database. The following commands create a database named apress and a table called apressBooks inside that database.

hive> create database apress;

Here is the output:

Time taken: 1.397 seconds

hive> use apress;

Here is the output:

Time taken: 0.07 seconds

hive> create table apressBooks (

> bookName String,

> bookWriter String

> )

> row format delimited

> fields terminated by ',';

Here is the output:

Time taken: 0.581 seconds

Step 2-6-11. Testing if Metadata Is Created in PostgreSQL

The database and table we created will be reflected in PostgreSQL. We can see the updated data in the TBLS table, as follows.

pymetastore=> SELECT * from "TBLS";

---------+-------------+-------+------------------+--------------+-----------+-------+-------------+---------------+

-----+-------------------

1 | 1482892229 | 6 | 0 | pysparksqlbook| 0 | 1 | apressbooks | MANAGED_TABLE | |

(1 row)

The appreciable work of connecting Hive to an external database is done. In the following recipe, we are going to install Apache Mesos.

Recipe 2-7. Connect PySpark to Hive

Problem

You want to connect PySpark to Hive.

Solution

PySpark needs a Hive property file to know the configuration parameters of Hive. The Hive property file called hive-site.xml stays in the Hive conf directory. We simply copy the Hive property file to the Spark conf directory. We are done. Now we can start PySpark.

How It Works

Two steps have been identified to connect PySpark to Hive.

Step 2-7-1. Copying the Hive Property File to the Spark conf Directory

[pysparksqlbook@localhost binaries]$cp /allBigData/hive/conf/hive-site.xml /allBigData/spark/conf/

Step 2-7-2. Starting PySpark

[pysparksqlbook@localhost binaries]$pyspark

Recipe 2-8. Install MySQL

Problem

You want to install MySQL Server.

Solution

We can read data from MySQL using PySparkSQL. We also can save the output of our analysis into a MySQL database. We can install MySQL Server using the yum installer.

How It Works

Follow these steps to complete the MySQL installation.

Step 2-8-1. Installing the MySQL Server

The following command will install the MySQL Server.

[pysparksqlbook@localhost binaries]$ sudo yum install mysql-server

Here is the output:

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

* base: centos.excellmedia.net

* extras: centos.excellmedia.net

Dependency Installed:

mysql-community-common.x86_64 0:5.6.41-2.el7

perl-Compress-Raw-Bzip2.x86_64 0:2.061-3.el7

perl-Compress-Raw-Zlib.x86_64 1:2.061-4.el7

perl-DBI.x86_64 0:1.627-4.el7

perl-Data-Dumper.x86_64 0:2.145-3.el7

perl-IO-Compress.noarch 0:2.061-2.el7

perl-Net-Daemon.noarch 0:0.48-5.el7

perl-PlRPC.noarch 0:0.2020-14.el7

Replaced:

mariadb.x86_64 1:5.5.60-1.el7_5 mariadb-libs.x86_64 1:5.5.60-1.el7_5

Complete!

Finally, we have installed MySQL.

Recipe 2-9. Install MongoDB

Problem

You want to install MongoDB.

Solution

MongoDB is a NoSQL database. It can be installed using the yum installer.

How It Works

Follow these steps to complete the MongoDB installation.

Step 2-9-1. Installing MongoDB

[pysparksqlbook@localhost book]$ sudo vim /etc/yum.repos.d/mongodb-org-4.0.repo

Inside this mongodb-org-4.0.repo file, copy the following:

[mongodb-org-4.0]

name=MongoDB Repository

baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/4.0/x86_64/

gpgcheck=1

enabled=1

gpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc

Note

You can get details about the MongoDB installation on Centos from the following link: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/ .

[pysparksqlbook@localhost book]$ sudo yum install -y mongodb-org-4.0.0 mongodb-org-server-4.0.0 mongodb-org-shell-4.0.0 mongodb-org-mongos-4.0.0 mongodb-org-tools-4.0.0

Here is the output:

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

* base: centos.excellmedia.net

* extras: centos.excellmedia.net

* updates: centos.excellmedia.net

Installed:

mongodb-org.x86_64 0:4.0.0-1.el7

mongodb-org-mongos.x86_64 0:4.0.0-1.el7

mongodb-org-server.x86_64 0:4.0.0-1.el7

mongodb-org-shell.x86_64 0:4.0.0-1.el7

mongodb-org-tools.x86_64 0:4.0.0-1.el7

Complete!

Step 2-9-2. Creating a Data Directory

We are going to create a data directory for MongoDB.

[pysparksqlbook@localhost book]$ sudo mkdir -p /data/db

[pysparksqlbook@localhost book]$ sudo chown pysparksqlbook:pysparksqlbook -R /data

Step 2-9-3. Starting the MongoDB Server

The MongoDB server will be started using the mongod command.

[pysparksqlbook@localhost binaries]$ mongod

Here is the output:

2018-08-26T17:59:42.029-0400 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'

2018-08-26T17:59:42.051-0400 I CONTROL [initandlisten] MongoDB starting : pid=22570 port=27017 dbpath=/data/db 64-bit host=localhost.localdomain

2018-08-26T17:59:42.051-0400 I CONTROL [initandlisten] db version v4.0.0

2018-08-26T18:06:19.459-0400 I NETWORK [conn1] end connection 127.0.0.1:59690 (0 connections now open)

Note that the MongoDB server is listening on 127.0.0.1:59690. Now we can start the MongoDB shell, called mongo.

[pysparksqlbook@localhost book]$ mongo

Here is the output:

MongoDB shell version v4.0.0

connecting to: mongodb://127.0.0.1:27017

To enable free monitoring, run the following command:

db.enableFreeMonitoring()

Recipe 2-10. Install Cassandra

Problem

You want to install Cassandra.

Solution

Cassandra is a NoSQL database. It can be installed using the yum installer.

How It Works

Follow these steps to complete the Cassandra installation:

[pysparksqlbook@localhost ~]$ sudo vim /etc/yum.repos.d/cassandra.repo

[sudo] password for pysparksqlbook:

Copy the following lines in the cassandra.repo file:

[cassandra]

name=Apache Cassandra

baseurl=https://www.apache.org/dist/cassandra/redhat/311x/

gpgcheck=1

repo_gpgcheck=1

gpgkey=https://www.apache.org/dist/cassandra/KEYS

[pysparksqlbook@localhost ~]$ sudo yum -y install cassandra

Here is the output:

Loaded plugins: fastestmirror, langpacks

Loading mirror speeds from cached hostfile

Running transaction

Installing : cassandra-3.11.3-1.noarch 1/1

Verifying : cassandra-3.11.3-1.noarch 1/1

Installed:

cassandra.noarch 0:3.11.3-1

Complete!

Now we start the server:

[pysparksqlbook@localhost ~]$ systemctl daemon-reload

[pysparksqlbook@localhost ~]$ systemctl start cassandra

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Installation

Create new playlist

Sign In

Sign Up

2. Installation

Recipe 2-1. Install Hadoop on a Single Machine

Problem

Solution

How It Works

Step 2-1-1. Creating a New CentOS User

Step 2-1-2. Adding a CentOS user to sudo

Step 2-1-3. Installing Java

Step 2-1-4. Creating Password-Less Logging from pysparksqlbook

Step 2-1-5. Downloading Hadoop

Step 2-1-6. Moving Hadoop Binaries to the Installation Directory

Step 2-1-7. Modifying the Hadoop Environment File

Step 2-1-8. Modifying the Hadoop Properties Files

Step 2-1-9. Updating the .bashrc File

Step 2-1-10. Running the Namenode Format

Step 2-1-11. Starting Hadoop

Step 2-1-12. Checking the Installation of Hadoop

Step 2-1-13. Stopping the Hadoop Processes

Recipe 2-2. Install Spark on a Single Machine

Problem

Solution

How It Works

Step 2-2-1. Downloading Apache Spark

Step 2-2-2. Extracting the .tgz File of Spark

Step 2-2-3. Moving the Extracted Spark Directory to /allBigData

Step 2-2-4. Changing the Spark Environment File

Step 2-2-5. Amending the .bashrc File

Step 2-2-6. Starting the PySpark Shell

Recipe 2-3. Use the PySpark Shell

Problem

Solution

How It Works

Recipe 2-4. Install Hive on a Single Machine

Problem

Solution

How It Works

Step 2-4-1. Downloading Hive

Step 2-4-2. Extracting Hive

Step 2-4-3. Moving the Extracted Hive Directory

Step 2-4-4. Updating hive-site.xml

Step 2-4-5. Updating the .bashrc File

Step 2-4-6. Creating Datawarehouse Directories of Hive

Step 2-4-7. Initiating the Metastore Database

Step 2-4-8. Checking the Hive Installation

Recipe 2-5. Install PostgreSQL

Problem

Solution

How It Works

Step 2-5-1. Installing PostgreSQL

Step 2-5-2. Initializing the Database

Step 2-5-3. Enabling and Starting the Database

Note

Recipe 2-6. Configure the Hive Metastore on PostgreSQL

Problem

Solution

How It Works

Step 2-6-1. Downloading the PostgreSQL JDBC Connector

Step 2-6-2. Copying the JDBC Connector to the Hive lib Directory

Step 2-6-3. Connecting to PostgreSQL

Step 2-6-4. Creating the Required User and Database

Step 2-6-5. Populating Data in the pymetastore Database

Step 2-6-6. Granting Permissions

Step 2-6-7. Changing the pg_hba.conf File

Step 2-6-8. Testing Our User

Step 2-6-9. Modifying the hive-site.xml File

Step 2-6-10. Starting Hive

Step 2-6-11. Testing if Metadata Is Created in PostgreSQL

Recipe 2-7. Connect PySpark to Hive

Problem

Solution

How It Works

Step 2-7-1. Copying the Hive Property File to the Spark conf Directory

Step 2-7-2. Starting PySpark

Recipe 2-8. Install MySQL

Problem

Solution

Table of Contents for
2. Installation