Hive and Impala

One of the design considerations for a new framework is always the compatibility with the old frameworks. For better or worse, most data analysts still work with SQL. The roots of the SQL go to an influential relational modeling paper (Codd, Edgar F (June 1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM (Association for Computing Machinery) 13 (6): 377–87). All modern databases implement one or another version of SQL.

While the relational model was influential and important for bringing the database performance, particularly for Online Transaction Processing (OLTP) to the competitive levels, the significance of normalization for analytic workloads, where one needs to perform aggregations, and for situations where relations themselves change and are subject to analysis, is less critical. This section will cover the extensions of standard SQL language for analysis engines traditionally used for big data analytics: Hive and Impala. Both of them are currently Apache licensed projects. The following table summarizes the complex types:

Type

Hive support since version

Impala support since version

Comments

ARRAY

This is supported since 0.1.0, but the use of non-constant index expressions is allowed only as of 0.14.

This is supported since 2.3.0 (only for Parquet tables).

This can be an array of any type, including complex. The index is int in Hive (bigint in Impala) and access is via array notation, for example, element[1] only in Hive (array.pos and item pseudocolumns in Impala).

MAP

This is supported since 0.1.0, but the use of non-constant index expressions is allowed only as of 0.14.

This is supported since 2.3.0 (only for Parquet tables).

The key should be of primitive type. Some libraries support keys of the string type only. Fields are accessed using array notation, for example, map["key"] only in Hive (map key and value pseudocolumns in Impala).

STRUCT

This is supported since 0.5.0.

This is supported since 2.3.0 (only for Parquet tables).

Access is using dot notation, for example, struct.element.

UNIONTYPE

This is supported since 0.7.0.

This is not supported in Impala.

Support is incomplete: queries that reference UNIONTYPE fields in JOIN (HIVE-2508), WHERE, and GROUP BY clauses will fail, and Hive does not define the syntax to extract the tag or value fields of UNIONTYPE. This means that UNIONTYPEs are effectively look-at-only.

While Hive/Impala tables can be created on top of many underlying file formats (Text, Sequence, ORC, Avro, Parquet, and even custom format) and multiple serializations, in most practical instances, Hive is used to read lines of text in ASCII files. The underlying serialization/deserialization format is LazySimpleSerDe (Serialization/Deserialization (SerDe)). The format defines several levels of separators, as follows:

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
    [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
    [NULL DEFINED AS char]

The default for separators are '01' or ^A, '02' or ^B, and '03' or ^B. In other words, it's using the new separator at each level of the hierarchy as opposed to the definition/repetition indicator in the Dremel encoding. For example, to encode the LabeledPoint table that we used before, we need to create a file, as follows:

$ cat data
0^A1^B1^D1.0$
2^A1^B1^D3.0$
1^A0^B0.0^C2.0^C0.0$
3^A0^B0.0^C4.0^C0.0$

Download Hive from http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.5.0.tar.gz and perform the follow:

$ tar xf hive-1.1.0-cdh5.5.0.tar.gz 
$ cd hive-1.1.0-cdh5.5.0
$ bin/hive

hive> CREATE TABLE LABELED_POINT ( LABEL INT, VECTOR UNIONTYPE<ARRAY<DOUBLE>, MAP<INT,DOUBLE>> ) STORED AS TEXTFILE;
OK
Time taken: 0.453 seconds
hive> LOAD DATA LOCAL INPATH './data' OVERWRITE INTO TABLE LABELED_POINT;
Loading data to table alexdb.labeled_point
Table labeled_point stats: [numFiles=1, numRows=0, totalSize=52, rawDataSize=0]
OK
Time taken: 0.808 seconds
hive> select * from labeled_point;
OK
0  {1:{1:1.0}}
2  {1:{1:3.0}}
1  {0:[0.0,2.0,0.0]}
3  {0:[0.0,4.0,0.0]}
Time taken: 0.569 seconds, Fetched: 4 row(s)
hive>

In Spark, select from a relational table is supported via the sqlContext.sql method, but unfortunately the Hive union types are not directly supported as of Spark 1.6.1; it does support maps and arrays though. The supportability of complex objects in other BI and data analysis tools still remains the biggest obstacle to their adoption. Supporting everything as a rich data structure in Scala is one of the options to converge on nested data representation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset