In the upcoming chapters, we are going to solve many problems using PySpark. PySpark also interacts with many other Big Data frameworks to provide end-to-end solutions. PySpark might read data from HDFS, NoSQL databases, or a relational database management system. After data analysis, we can save the results into HDFS or databases.
This chapter deals with all the software installations that are required to go through this book. We are going to install all the required Big Data frameworks on the CentOS operating system. CentOS is an enterprise-class operating system. It is free to use and easily available. We can download CentOS from the https://www.centos.org/download/ link and install it on a virtual machine.
In this chapter, we are going to discuss the following recipes:
Recipe 2-1. Install Hadoop on a single machine
Recipe 2-2. Install Spark on a single machine
Recipe 2-3. Use the PySpark shell
Recipe 2-4. Install Hive on a single machine
Recipe 2-5. Install PostgreSQL
Recipe 2-6. Configure the Hive metastore on PostgreSQL
Recipe 2-7. Connect PySpark to Hive
Recipe 2-8. Install MySQL
Recipe 2-9. Install MongoDB
Recipe 2-10. Install Cassandra
I suggest that you install every piece of software on your own. It is a good exercise and will give you a deeper understanding of the components of each software package.
Recipe 2-1. Install Hadoop on a Single Machine
Problem
You want to install Hadoop on a single machine.
Solution
You might be wondering, why are we installing Hadoop while we are learning PySpark? Are we going to use Hadoop MapReduce as the distributed framework for our problem solving? The answer is, not at all. We are going to use two components of Hadoop—HDFS and YARN. HDFS for data storage and YARN as a cluster manager. In order to install Hadoop, we need to download and configure it.
How It Works
Follow these steps to complete the Hadoop installation.
Step 2-1-1. Creating a New CentOS User
A new user is created. You might be thinking, why a new user? Why can’t we install Hadoop in an existing user? The reason behind that is that we want to provide a dedicated user for all the Big Data frameworks. In the following lines of code, we are going to create a user named pysparksqlbook.
[root@localhost book]# adduser pysparksqlbook
[root@localhost book]# passwd pysparksqlbook’
Here is the output:
Changing password for user pysparksqlbook.
New password:
passwd: all authentication tokens updated successfully.
In the above part of the code, we can see that the adduser command has been used to create or add a user. The Linux passwd command has been used to provide a password to our new user pysparksqlbook.
After creating a user, we have to add it to sudo. Sudo stands for “superuser do”. Using sudo, we can run any code as the superuser. Sudo will be used to install the software.
Step 2-1-2. Adding a CentOS user to sudo
Then, we have to add our new user pysparksqlbook to the sudo. The following command will do this.
We will create two directories—the binaries directory under the home directory to download software and the allBigData directory under the root / directory to install the Big Data frameworks.
Hadoop, Hive, Spark, and many Big Data frameworks use JVM. It’s why we are first going to install Java. We are going to use OpenJDK for our purposes. We are installing the 8th version of OpenJDK. We can install Java on CentOS using the yum installer. The following command installs Java using the yum installer.
Java has been installed. After installation of any software, it is a good idea to check the installation. Checking the installation will show you that everything is fine.
In order to check the Java installation, I prefer the java -version command, which will return the version of JVM installed.
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
Java has been installed. We have to look for the environment variable JAVA_HOME, which is going to be used by all the distributed frameworks. After installing Java, we can find the Java home variable by using jrunscript, as follows.
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:md4M1J6VEYQm3gSynB0gqIYFpesp6I2cRvlEvJOIFFE.
ECDSA key fingerprint is MD5:78:cf:a7:71:2e:38:c2:62:01:65:c2:4c:71:7e:3c:90.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Thu July 13 17:02:56 2018
Finally:
[pysparksqlbook@localhost ~]$ exit
Here is the output:
logout
Connection to localhost closed.
Step 2-1-5. Downloading Hadoop
We are now going to download Hadoop from the Apache website. As mentioned, we will download all the required packages to the binaries directory. We are going to use the wget command to download Hadoop.
Step 2-1-6. Moving Hadoop Binaries to the Installation Directory
Our installation directory is called allBigData. The downloaded software is in hadoop-2.7.7.tar.gz, which is a compressed directory. So we first have to decompress it. We can decompress it using the tar command as follows:
[pysparksqlbook@localhost binaries]$ tar xvzf hadoop-2.7.7.tar.gz
Now we move Hadoop under the allBigData directory.
We have to make some changes to the Hadoop environment file. The Hadoop environment file is found in the Hadoop configuration directory. In our case, the Hadoop configuration directory is /allBigData/hadoop/etc/hadoop/. In the following line of code, we add JAVA_HOME to the hadoop-env.sh file.
[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hadoop-env.sh
After opening the Hadoop environment file, add the following line.
core-site.xml: Core properties related to the cluster
mapred-site.xml: Properties for the MapReduce Framework
These properties files will be found in the Hadoop configuration directory. In the previous chapter, we discussed HDFS. We found that HDFS has two components—NameNode and DataNode. We also discussed that HDFS does data replication for fault tolerance. In our hdfs-site.xml file, we are going to set the namenode directory using the dfs.name.dir parameter, the datanode directory using the dfs.data.dir parameter, and the replication factor using the dfs.replication parameter.
Let’s modify hdfs-site.xml.
[pysparksqlbook@localhost binaries]$ vim /allBigData/hadoop/etc/hadoop/hdfs-site.xml
After opening hdfs-site.xml, we have to add the following lines to that file.
<property>
<name>dfs.name.dir</name>
<value>file:/allBigData/hdfs/namenode</value>
<description>NameNode location</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:/allBigData/hdfs/datanode</value>
<description>DataNode location</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description> Number of block replication </description>
</property>
After updating hdfs-site.xml, we are going to update core-site.xml. In core-site.xml, we are going to update two properties—fs.default.name and hadoop.tmp.dir. The fs.default.name property is used to determine the host, port, etc. of the filesystem. The hadoop.tmp.dir property determines the temporary directories for Hadoop. We have to add the following lines to core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9745</value>
<description>Host port of file system</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/application/hadoop/tmp</value>
<description>Temp directory for other working and tmp directories</description>
</property>
Finally, we are going to modify mapred-site.xml. We are going to modify mapreduce.framework.name, which will decide which runtime framework is to be used. The possible values are local, classic, or yarn. We have to add the following code to the mapred-site.xml file.
We have updated some property files. We are supposed to run namenode format, so that all the changes will be reflected in our framework. The following command will format namenode.
18/06/13 18:14:49 INFO namenode.FSImageFormatProtobuf: Image file /allBigData/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 331 bytes saved in 0 seconds.
18/06/13 18:14:50 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/06/13 18:14:50 INFO util.ExitUtil: Exiting with status 0
18/06/13 18:14:50 INFO namenode.NameNode: SHUTDOWN_MSG:
Hadoop has been installed. We have to start Hadoop now. We can find the Hadoop starting script in /allBigData/hadoop/sbin/. We have to run the start-dfs.sh and start-yarn.sh scripts, in sequence.
starting resourcemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /allBigData/hadoop/logs/yarn-pysparksqlbook-nodemanager-localhost.localdomain.out
Step 2-1-12. Checking the Installation of Hadoop
We know that the jps command will show all the Java processes running on the machine. If everything is fine, it will show all the processes running as follows:
[pysparksqlbook@localhost binaries]$ jps
Here is the output:
13441 NodeManager
13250 ResourceManager
12054 DataNode
14054 Jps
12423 SecondaryNameNode
11898 NameNode
Congratulations to us. We have finally installed Hadoop on our system.
Step 2-1-13. Stopping the Hadoop Processes
As we started Hadoop using the two scripts start-dfs.sh and start-yarn.sh in sequence, in a similar fashion, we can stop the Hadoop process using the stop-dfs.sh and stop-yarn.sh shell scripts, in sequence.
We are going to install prebuilt spark-2.3.0 for Hadoop version 2.7. We could build Spark from the source code. But we are going to use the prebuilt Apache Spark.
How It Works
Follow these steps to complete the installation.
Step 2-2-1. Downloading Apache Spark
We are going to download Spark from its mirror. We are going to use the wget command for that purpose, as follows.
The Spark environment file possesses all the environment variables required to run Spark. We are going to set the following environmental variables in the environmental file.
HADOOP_CONF_DIR: Configuration directory of Hadoop.
SPARK_LOG_DIR: Where log files are stored (Default: ${SPARK_HOME}/log)
SPARK_WORKER_DIR: To set the working directory of any worker processes
HIVE_CONF_DIR: To read data from Hive
At first we have to copy the spark-env.sh.template file to spark-env.sh. The Spark environment file named spark-env.sh is found inside the spark/conf file (configuration directory location):
We can start the PySpark shell using the pyspark script. Discussion about the pyspark script will continue in the next recipe.
[pysparksqlbook@localhost binaries]$ pyspark
We have one more successful installation under our belt. But we have to go further. More installation is required to move through this book. But before all that, it is better to concentrate on the PySpark shell.
Recipe 2-3. Use the PySpark Shell
Problem
You want to use the PySpark shell.
Solution
The PySpark shell is an interactive shell to interact with PySpark using Python. The PySpark shell can be started using the pyspark script. The pyspark script can be found at spark/bin.
How It Works
The PySpark shell can be started as follows.
[pysparksqlbook@localhost binaries]$ pyspark
After starting, it will show the screen in Figure 2-1.
We can observe that, after starting PySpark, it displays lots of information. It displays information about the Python and PySpark versions it is using.
The >>> symbol is Python’s command prompt. Whenever we start the Python shell, we get this symbol. It tells us that we can now write our Python commands. Similarly in PySpark, it tells us that we can now write our Python or PySpark command and see the result.
The PySpark shell works in a similar fashion on a single machine installation and a cluster installation of PySpark.
Recipe 2-4. Install Hive on a Single Machine
Problem
You want to install Hive on a single machine.
Solution
We discussed Hive in the first chapter. Now it is time to install Hive on our machine. We are going to read data from Hive to PySparkSQL in coming chapters.
How It Works
Follow these steps to complete the Hive installation.
Step 2-4-1. Downloading Hive
We can download Hive from the Apache Hive website. We can download the Hive tar.gz file using the wget command, as follows.
We have downloaded the apache-hive-2.3.3-bin.tar.gz file. It is a .tar.gz file, so we have to extract it. We can extract it using the tar command as follows.
[pysparksqlbook@localhost binaries]$ tar xvzf apache-hive-2.3.3-bin.tar.gz
Hive is dispatched with an embedded Derby database for metastore. The Derby database is memory-less. Hence, it is better to provide a definite location for it. We provide that location in hive-site.xml. For that, we have to move hive-default.xml.template to hive-site.xml.
The /user/hive/warehouse directory is the Hive warehouse directory.
Step 2-4-7. Initiating the Metastore Database
Sometimes it is necessary to initiate schema. You might be thinking, schema of what? We know that Hive stores metadata of tables in a relational database. For the time being, we are going to use the Derby database as a metastore database of Hive. Then, in coming recipes, we are going to connect our Hive to an external PostgreSQL. In Ubuntu, Hive installation works without this command. But in CentOS, I found it indispensable to run. Without the following command, Hive was throwing errors.
Now Hive has been installed. We should check our work success. We can start the Hive shell using the following command.
[pysparksqlbook@localhost binaries]$ hive
After this command, we will find that the Hive shell has been opened as follows:
hive>
Recipe 2-5. Install PostgreSQL
Problem
You want to install PostgreSQL.
Solution
PostgreSQL is a Relational Database Management System. It was developed at the University of California. It comes under the PostgreSQL License. It provides permission to use, modify, and distribute under the PostgreSQL license. PostgreSQL can run on MacOS X and UNIX-like systems such as Red Hat, Ubuntu, etc. We are going to install it on CentOS.
We are going to use our PostgreSQL in two ways. We will use PostgreSQL as a metastore database for Hive. After having an external database as the metastore, we will be able to read data from the existing Hive easily. The second use of this RDBMS installation is to read data from PostgreSQL, and after analysis, we will save our result to PostgreSQL.
Installing PostgreSQL can be done with source code, but we are going to install it using the command-line yum installer.
How It Works
Follow these steps to complete the PostgreSQL installation.
Step 2-5-1. Installing PostgreSQL
PostgreSQL can be installed using the yum installer. The following code will install PostgreSQL.
Recipe 2-6. Configure the Hive Metastore on PostgreSQL
Problem
You want to configure Hive metastore on PostgreSQL.
Solution
As we know, Hive puts a metadata of tables in a relational database. We have already installed Hive, which has an embedded metastore. Hive uses the Derby Relational Database System for a metastore. In coming chapters, we have to read existing Hive tables from PySpark.
Configuration of a Hive metastore on PostgreSQL requires us to populate tables in the PostgreSQL database. These tables will hold metadata of Hive tables. After this, we have to configure the Hive property file.
How It Works
In the following steps, we are going to configure a Hive metastore on the PostgreSQL database. Then our Hive will have metadata in PostgreSQL.
Step 2-6-1. Downloading the PostgreSQL JDBC Connector
We need the JDBC connector so that the Hive process can connect to an external PostgreSQL. We can get the JDBC connector using the following command.
Step 2-6-4. Creating the Required User and Database
In the following lines of this step, we are going to create a PostgreSQL user named pysparksqlbookUser. Then we are going to create a database named pymetastore. This database is going to hold all the tables related to the Hive metastore.
postgres=# CREATE USER pysparksqlbookUser WITH PASSWORD 'pbook';
Here is the output:
CREATE ROLE
Intro.
postgres=# CREATE DATABASE pymetastore;
Here is the output:
CREATE DATABASE
The c PostgreSQL command stands for connect. We created our database named pymetastore. Now we are going to connect to this database using our c command.
postgres=# c pymetastore;
We are now connected to the pymetastore database. We can see more PostgreSQL commands using the following link.
Step 2-6-5. Populating Data in the pymetastore Database
Hive possesses its own PostgreSQL scripts to populate tables for the metastore. The i command reads commands from the PostgreSQL script and executes those commands. In the following command, we are going to run the hive-txn-schema-2.3.0.postgres.sql script, which will create all the tables required for the Hive metastore.
pymetastore=# i /allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql
Here is the output:
psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "txns_pkey" for table "txns"
CREATE TABLE
CREATE TABLE
INSERT 0 1
psql:/allBigData/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.3.0.postgres.sql:69: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "hive_locks_pkey" for table "hive_locks"
CREATE TABLE
Step 2-6-6. Granting Permissions
The following commands will grant some permissions.
pymetastore=# grant select, insert,update,delete on public.txns to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.txn_components to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_txn_components to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.next_txn_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.hive_locks to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.next_lock_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.compaction_queue to pysparksqlbookUser;
Here is the output
GRANT
pymetastore=# grant select, insert,update,delete on public.next_compaction_queue_id to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_compactions to pysparksqlbookUser;
Here is the output:
GRANT
pymetastore=# grant select, insert,update,delete on public.aux_table to pysparksqlbookUser;
Here is the output:
GRANT
Step 2-6-7. Changing the pg_hba.conf File
Remember that, in order to update pg_hba.conf, you are supposed to be the root user. So first go to the root user. Then open the pg_hba.conf file.
[root@localhost binaries]# vim /var/lib/pgsql/data/pg_hba.conf
Then change all the peer and ident settings to trust.
<description>Driver class of postgreSQL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>pysparksqlbookuser</value>
<description>User name to connect to postgreSQL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>pbook</value>
<description>password for connecting to PostgreSQL server</description>
</property>
Step 2-6-10. Starting Hive
We have connected Hive to an external relational database management system. So it is time to start Hive and ensure that everything is fine.
[pysparksqlbook@localhost binaries]$ hive
Our activities will be reflected in PostgreSQL. Let’s create a database and a table inside that database. The following commands create a database named apress and a table called apressBooks inside that database.
hive> create database apress;
Here is the output:
OK
Time taken: 1.397 seconds
hive> use apress;
Here is the output:
OK
Time taken: 0.07 seconds
hive> create table apressBooks (
> bookName String,
> bookWriter String
> )
> row format delimited
> fields terminated by ',';
Here is the output:
OK
Time taken: 0.581 seconds
Step 2-6-11. Testing if Metadata Is Created in PostgreSQL
The database and table we created will be reflected in PostgreSQL. We can see the updated data in the TBLS table, as follows.
The appreciable work of connecting Hive to an external database is done. In the following recipe, we are going to install Apache Mesos.
Recipe 2-7. Connect PySpark to Hive
Problem
You want to connect PySpark to Hive.
Solution
PySpark needs a Hive property file to know the configuration parameters of Hive. The Hive property file called hive-site.xml stays in the Hive conf directory. We simply copy the Hive property file to the Spark conf directory. We are done. Now we can start PySpark.
How It Works
Two steps have been identified to connect PySpark to Hive.
Step 2-7-1. Copying the Hive Property File to the Spark conf Directory
We can read data from MySQL using PySparkSQL. We also can save the output of our analysis into a MySQL database. We can install MySQL Server using the yum installer.
How It Works
Follow these steps to complete the MySQL installation.
Step 2-8-1. Installing the MySQL Server
The following command will install the MySQL Server.