Hive supports many of the primitive data types you find in relational databases, as well as three collection data types that are rarely found in relational databases, for reasons we’ll discuss shortly.
A related concern is how these types are represented in text files, as well as alternatives to text storage that address various performance and other concerns. A unique feature of Hive, compared to most databases, is that it provides great flexibility in how data is encoded in files. Most databases take total control of the data, both how it is persisted to disk and its life cycle. By letting you control all these aspects, Hive makes it easier to manage and process data with a variety of tools.
Hive supports several sizes of integer and floating-point types, a Boolean type, and character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary fields.
Table 3-1 lists the primitive types supported by Hive.
Table 3-1. Primitive data types
As for other SQL dialects, the case of these names is ignored.
It’s useful to remember that each of these types is
implemented in Java, so the particular behavior details will be exactly
what you would expect from the corresponding Java types. For example,
STRING
is implemented by the Java
String
, FLOAT
is implemented by Java float
, etc.
Note that Hive does not support “character arrays” (strings) with maximum-allowed lengths, as is common in other SQL dialects. Relational databases offer this feature as a performance optimization; fixed-length records are easier to index, scan, etc. In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant.
Values of the new TIMESTAMP
type
can be integers, which are interpreted as seconds since the Unix epoch
time (Midnight, January 1, 1970), floats, which are interpreted as seconds
since the epoch time with nanosecond resolution (up to 9 decimal places),
and strings, which are interpreted according to the JDBC date string
format convention, YYYY-MM-DD
hh:mm:ss.fffffffff
.
TIMESTAMPS
are interpreted as UTC
times. Built-in functions for conversion to and from timezones are
provided by Hive, to_utc_timestamp
and
from_utc_timestamp
, respectively (see
Chapter 13 for more details).
The BINARY
type is similar to the
VARBINARY
type found in many relational
databases. It’s not like a BLOB
type,
since BINARY
columns are stored within
the record, not separately like BLOBs
.
BINARY
can be used as a way of
including arbitrary bytes in a record and preventing Hive from attempting
to parse them as numbers, strings, etc.
Note that you don’t need BINARY
if your goal is to ignore the tail end of each record. If a table schema
specifies three columns and the data files contain five values for each
record, the last two will be ignored by Hive.
What if you run a query that wants to compare a float column
to a double column or compare a value of one integer type with a value of
a different integer type? Hive will implicitly cast
any integer to the larger of the two integer types, cast FLOAT
to DOUBLE
, and cast any integer value to DOUBLE
, as needed, so it is comparing identical
types.
What if you run a query that wants to interpret a string column as a
number? You can explicitly cast one type to another as in the following
example, where s
is a string column
that holds a value representing an integer:
...
cast
(
s
AS
INT
)
...;
(To be clear, the AS INT
are
keywords, so lowercase would be fine.)
We’ll discuss data conversions in more depth in Casting.
Hive supports columns that are structs
, maps
, and arrays
. Note that the literal syntax examples in
Table 3-2 are actually calls to
built-in functions.
Table 3-2. Collection data types
As for simple types, the case of the type name is ignored.
Most relational databases don’t support such collection
types, because using them tends to break normal form.
For example, in traditional data models, structs
might
be captured in separate tables, with foreign key relations between the
tables, as appropriate.
A practical problem with breaking normal form is the greater risk of data duplication, leading to unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out of sync as changes are made.
However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput. Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks. Navigating each foreign key relationship requires seeking across the disk, with significant performance overhead.
Hive doesn’t have the concept of keys. However, you can index tables, as we’ll see in Chapter 7.
Here is a table declaration that demonstrates how to use these types, an employees table in a fictitious Human Resources application:
CREATE
TABLE
employees
(
name
STRING
,
salary
FLOAT
,
subordinates
ARRAY
<
STRING
>
,
deductions
MAP
<
STRING
,
FLOAT
>
,
address
STRUCT
<
street
:
STRING
,
city
:
STRING
,
state
:
STRING
,
zip
:
INT
>
);
The name
is a simple string and
for most employees, a float is large enough for the salary. The list of
subordinates
is an array of string
values, where we treat the name
as a
“primary key,” so each element in subordinates
would reference another record in
the table. Employees without subordinates would have an empty array. In a
traditional model, the relationship would go the other way, from an
employee to his or her manager. We’re not arguing that our model is better
for Hive; it’s just a contrived example to illustrate the use of
arrays.
The deductions
is a map that
holds a key-value pair for every deduction that will be subtracted from
the employee’s salary when paychecks are produced. The key is the name of
the deduction (e.g., “Federal Taxes”), and the key would either be a
percentage value or an absolute number. In a traditional data model, there
might be separate tables for deduction type (each key in our map), where
the rows contain particular deduction values and a foreign key pointing
back to the corresponding employee record.
Finally, the home address of each employee is represented as a
struct
, where each field is named and
has a particular type.
Note that Java syntax conventions for generics
are followed for the collection types. For example, MAP<STRING, FLOAT>
means that every key in
the map will be of type STRING
and
every value will be of type FLOAT
. For
an ARRAY<STRING>
, every item in
the array will be a STRING
. STRUCTs
can mix different types, but the
locations are fixed to the declared position in the STRUCT
.
Let’s begin our exploration of file formats by looking at the simplest example, text files.
You are no doubt familiar with text files delimited with commas or
tabs, the so-called comma-separated values (CSVs) or
tab-separated values (TSVs), respectively. Hive can
use those formats if you want and we’ll show you how shortly. However,
there is a drawback to both formats; you have to be careful about commas
or tabs embedded in text and not intended as field or column delimiters.
For this reason, Hive uses various control characters by default, which
are less likely to appear in value strings. Hive uses the term field
when overriding the default delimiter, as
we’ll see shortly. They are listed in Table 3-3.