Esri GIS tools for Hadoop

With your environment set up and some basic knowledge of Ambari, HDFS, and Hive, you will now learn how to add a spatial component to your queries. To do so, we will use the Esri GIS tools for Hadoop.

The first step is to download the files located at the GitHub repository, which is located at: https://github.com/Esri/gis-tools-for-hadoop. You will be using Ambari to move the files to HDFS not the container, so download these files to your local machine.

Esri has a tutorial for downloading the files by using ssh to connect to the container and then using git to clone the repository. You can follow these instructions here: https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners.

You can download the files by using the GitHub Clone or download button on the right-hand side of the repository. To unzip the archive, use one of the following commands:

unzip gis-tools-for-hadoop-master.zip
unzip gis-tools-for-hadoop-master.zip -d /home/pcrickard

The first command will unzip the file in the current directory, which is most likely the Downloads folder of your home directory. The second command will unzip the file, but by passing -d and a path, it will unzip to that location. In this case, this is the root of my home directory.

Now that you have the files unzipped, you can open the Files View in Ambari by selecting it from the box icon drop-down menu. Select Upload and a modal will open, allowing you to drop a file. On your local machine, browse to the location of the Esri Java ARchive (JAR) files. If you moved the zip to your home directory, the path will be similar to /home/pcrickard/gis-tools-for-hadoop-master/samples/lib. You will have three JAR files:

  • esri-geometry-api-2.0.0.jar
  • spatial-sdk-hive-2.0.0.jar
  • spatial-sdk-json-2.0.0.jar

Move each of these three files to the root folder in Ambari. This is the / directory, which is the default location that opens when you launch Files View.

Next, you would normally move the data to HDFS as well, however, you did that in the previous example. In this example, you will leave the data files on your local machine and you will learn how you can load them into a Hive table without being on HDFS.

Now you are ready to execute the spatial query in Hive. From the box icon drop-down, select Hive View 2.0. In the query pane, enter the following code:

add jar hdfs:///esri-geometry-api-2.0.0.jar;
add jar hdfs:///spatial-sdk-json-2.0.0.jar;
add jar hdfs:///spatial-sdk-hive-2.0.0.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';

drop table earthquakes;
drop table counties;

CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

LOAD DATA LOCAL INPATH '/gis-tools-for-hadoop-master/samples/data/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;

LOAD DATA LOCAL INPATH '/gis-tools-for-hadoop-master/samples/data/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;

SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

Running the preceding code will take some time depending on your machine. The end result will look like it does in the following image:

The previous code and results were presented without explanation so that you could get the example to work and see the output. Following that, the code will be explained block by block.

The first block of code is shown as follows:

add jar hdfs:///esri-geometry-api-2.0.0.jar;
add jar hdfs:///spatial-sdk-json-2.0.0.jar;
add jar hdfs:///spatial-sdk-hive-2.0.0.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';

This block adds the JAR files from the HDFS location. In this case, it is the / folder. Once the code loads the JAR files, it can then create the functions ST_Point and ST_Contains by calling the classes in the JAR files. A JAR file may contain many Java files (classes). The order of the add jar statements matter.

The following block drops two tables—earthquakes and counties. If you had never run the example, you could skip these lines:

drop table earthquakes;
drop table counties;

Next, the code creates the tables for earthquakes and counties. The earthquakes table is created and each field and type are passed to CREATE. The row format is specified as CSV—the ','. Lastly, it is in a text file:

CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,magtype string, mbstations string, gap string, distance string, rms string, source 
string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

The counties table is created in a similar fashion by passing the field names and types to CREATE, but the data is in JSON format and will use the com.esri.hadoop.hive.serde.EsriJSonSerDe class in the JAR spatial-sdk-json-2.0.0 that you imported. STORED AS INPUTFORMAT and OUTPUTFORMAT are required for Hive to know how to parse and work with the JSON data:

CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)                  
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

The next two blocks load the data into the created tables. The data exists on your local machine and not on HDFS. To use the local data without first loading it in HDFS, you can use the LOCAL command with LOAD DATA INPATH and specify the local path of the data:

LOAD DATA LOCAL INPATH '/gis-tools-for-hadoop-master/samples/data/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;

LOAD DATA LOCAL INPATH '/gis-tools-for-hadoop-master/samples/data/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;

With the JAR files loaded and the tables created and populated with data, you can now run a spatial query using the two defined functions—ST_Point and ST_Contains. These are used as in the examples from Chapter 3Introduction to Geodatabases:

 SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape,
ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

The previous query selects the name of the county and the count of earthquakes by passing the county geometry and the location of each earthquake as a point to ST_Contains. The results are shown as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset