The different steps to set up Hadoop are as follows:
- Download the JAR from the Maven repository at http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/mongo-hadoop-core/2.0.2/.
- Download mongo-java-driver from https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongodb-driver/3.5.0/.
- Create a directory (in our case, named mongo_lib) and copy these two JARs in there with the following command:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:<path_to_directory>/mongo_lib/
Alternatively, we can copy these JARs under the share/hadoop/common/ directory. As these JARs will need to be available in every node, for clustered deployment, it's easier to use Hadoop's DistributedCache to distribute the JARs to all nodes.
- The next step is to install Hive from https://hive.apache.org/downloads.html. For this example, we used a MySQL server for Hive's metastore data. This can be a local MySQL server for development, but it is recommended that you use a remote server for production environments.
- Once we have Hive set up, we just run the following command:
> hive
- Then, we add the three JARs (mongo-hadoop-core, mongo-hadoop-driver, and mongo-hadoop-hive) that we downloaded earlier:
hive> add jar /Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar]
hive> add jar /Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar]
hive> add jar /Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar]
hive>
And then, assuming our data is in the table exchanges:
customerid |
int |
pair |
String |
time |
TIMESTAMP |
recommendation |
int |
We can also use Gradle or Maven to download the JARs in our local project. If we only need MapReduce, then we just download the mongo-hadoop-core JAR. For Pig, Hive, Streaming, and so on, we must download the appropriate JARs from
http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/.
Some useful Hive commands include the following: show databases; and
create table exchanges(customerid int, pair String, time TIMESTAMP, recommendation int);
http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/.
Some useful Hive commands include the following: show databases; and
create table exchanges(customerid int, pair String, time TIMESTAMP, recommendation int);
- Now that we are all set, we can create a MongoDB collection backed by our local Hive data:
hive> create external table exchanges_mongo (objectid STRING, customerid INT,pair STRING,time STRING, recommendation INT) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES('mongo.columns.mapping'='{"objectid":"_id", "customerid":"customerid","pair":"pair","time":"Timestamp", "recommendation":"recommendation"}') tblproperties('mongo.uri'='mongodb://localhost:27017/exchange_data.xmr_btc');
- Finally, we can copy all data from the exchanges Hive table into MongoDB as follows:
hive> Insert into table exchanges_mongo select * from exchanges;
This way, we have established a pipeline between Hadoop and MongoDB using Hive, without any external server.