After learning the many ways we can define and format tables, let’s learn how to run queries. Of course, we have assumed all along that you have some prior knowledge of SQL. We’ve used some queries already to illustrate several concepts, such as loading query data into other tables in Chapter 5. Now we’ll fill in most of the details. Some special topics will be covered in subsequent chapters.
We’ll move quickly through details that are familiar to users with prior SQL experience and focus on what’s unique to HiveQL, including syntax and feature differences, as well as performance implications.
SELECT
is the
projection operator in SQL. The FROM
clause identifies from which table, view,
or nested query we select records (see Chapter 7).
For a given record, SELECT
specifies the columns to keep, as well as the outputs of function calls on
one or more columns (e.g., the aggregation functions
like count(*)
).
Recall again our partitioned employees
table:
CREATE
TABLE
employees
(
name
STRING
,
salary
FLOAT
,
subordinates
ARRAY
<
STRING
>
,
deductions
MAP
<
STRING
,
FLOAT
>
,
address
STRUCT
<
street
:
STRING
,
city
:
STRING
,
state
:
STRING
,
zip
:
INT
>
)
PARTITIONED
BY
(
country
STRING
,
state
STRING
);
Let’s assume we have the same contents we showed in Text File Encoding of Data Values for four employees in the US state of Illinois (abbreviated IL). Here are queries of this table and the output they produce:
hive
>
SELECT
name
,
salary
FROM
employees
;
John
Doe
100000
.
0
Mary
Smith
80000
.
0
Todd
Jones
70000
.
0
Bill
King
60000
.
0
The following two queries are identical. The second version
uses a table alias e
, which is not very
useful in this query, but becomes necessary in queries with JOINs
(see JOIN Statements)
where several different tables are used:
hive
>
SELECT
name
,
salary
FROM
employees
;
hive
>
SELECT
e
.
name
,
e
.
salary
FROM
employees
e
;
When you select columns that are one of the collection types, Hive
uses JSON (JavaScript Object Notation) syntax for the output. First, let’s
select the subordinates
, an ARRAY
, where a comma-separated list surrounded
with […] is used. Note that STRING
elements of the collection are quoted, while the primitive STRING name
column is not:
hive
>
SELECT
name
,
subordinates
FROM
employees
;
John
Doe
[
"Mary Smith"
,
"Todd Jones"
]
Mary
Smith
[
"Bill King"
]
Todd
Jones
[]
Bill
King
[]
The deductions
is a MAP
, where the JSON representation for maps is
used, namely a comma-separated list of key:value
pairs, surrounded with {...}
:
hive
>
SELECT
name
,
deductions
FROM
employees
;
John
Doe
{
"Federal Taxes"
:
0
.
2
,
"State Taxes"
:
0
.
05
,
"Insurance"
:
0
.
1
}
Mary
Smith
{
"Federal Taxes"
:
0
.
2
,
"State Taxes"
:
0
.
05
,
"Insurance"
:
0
.
1
}
Todd
Jones
{
"Federal Taxes"
:
0
.
15
,
"State Taxes"
:
0
.
03
,
"Insurance"
:
0
.
1
}
Bill
King
{
"Federal Taxes"
:
0
.
15
,
"State Taxes"
:
0
.
03
,
"Insurance"
:
0
.
1
}
Finally, the address
is a
STRUCT
, which is also written using the
JSON map format:
hive
>
SELECT
name
,
address
FROM
employees
;
John
Doe
{
"street"
:
"1 Michigan Ave."
,
"city"
:
"Chicago"
,
"state"
:
"IL"
,
"zip"
:
60600
}
Mary
Smith
{
"street"
:
"100 Ontario St."
,
"city"
:
"Chicago"
,
"state"
:
"IL"
,
"zip"
:
60601
}
Todd
Jones
{
"street"
:
"200 Chicago Ave."
,
"city"
:
"Oak Park"
,
"state"
:
"IL"
,
"zip"
:
60700
}
Bill
King
{
"street"
:
"300 Obscure Dr."
,
"city"
:
"Obscuria"
,
"state"
:
"IL"
,
"zip"
:
60100
}
Next, let’s see how to reference elements of collections.
First, ARRAY
indexing is
0-based, as in Java. Here is a query that selects the first element of the
subordinates
array:
hive
>
SELECT
name
,
subordinates
[
0
]
FROM
employees
;
John
Doe
Mary
Smith
Mary
Smith
Bill
King
Todd
Jones
NULL
Bill
King
NULL
Note that referencing a nonexistent element returns NULL
. Also, the extracted STRING
values are no longer quoted!
To reference a MAP
element, you
also use ARRAY[...]
syntax, but with
key values instead of integer indices:
hive
>
SELECT
name
,
deductions
[
"State Taxes"
]
FROM
employees
;
John
Doe
0
.
05
Mary
Smith
0
.
05
Todd
Jones
0
.
03
Bill
King
0
.
03
Finally, to reference an element in a STRUCT
, you use “dot” notation, similar to the
table_alias.column
mentioned
above:
hive
>
SELECT
name
,
address
.
city
FROM
employees
;
John
Doe
Chicago
Mary
Smith
Chicago
Todd
Jones
Oak
Park
Bill
King
Obscuria
These same referencing techniques are also used in WHERE
clauses, which we discuss in WHERE Clauses.
We can even use regular expressions
to select the columns we want. The following query selects the symbol
column and all columns from stocks
whose names start with the prefix
price
:[17]
hive
>
SELECT
symbol
,
`
price
.
*`
FROM
stocks
;
AAPL
195
.
69
197
.
88
194
.
0
194
.
12
194
.
12
AAPL
192
.
63
196
.
0
190
.
85
195
.
46
195
.
46
AAPL
196
.
73
198
.
37
191
.
57
192
.
05
192
.
05
AAPL
195
.
17
200
.
2
194
.
42
199
.
23
199
.
23
AAPL
195
.
91
196
.
32
193
.
38
195
.
86
195
.
86
...
We’ll talk more about Hive’s use of regular expressions in the section LIKE and RLIKE.
Not only can you select columns in a table, but you can manipulate column values using function calls and arithmetic expressions.
For example, let’s select the employees’ names converted to
uppercase, their salaries, federal taxes percentage, and the value that
results if we subtract the federal taxes portion from their salaries and
round to the nearest integer. We could call a built-in function map_values
to extract all the values from the
deductions
map and then add them up
with the built-in sum
function.
The following query is long enough that we’ll split it over two
lines. Note the secondary prompt that Hive uses, an indented
greater-than sign (>
):
hive
>
SELECT
upper
(
name
),
salary
,
deductions
[
"Federal Taxes"
],
>
round
(
salary
*
(
1
-
deductions
[
"Federal Taxes"
]))
FROM
employees
;
JOHN
DOE
100000
.
0
0
.
2
80000
MARY
SMITH
80000
.
0
0
.
2
64000
TODD
JONES
70000
.
0
0
.
15
59500
BILL
KING
60000
.
0
0
.
15
51000
Let’s discuss arithmetic operators and then discuss the use of functions in expressions.
All the typical arithmetic operators are supported. Table 6-1 describes the specific details.
Table 6-1. Arithmetic operators
Operator | Types | Description |
---|---|---|
| Numbers | Add |
| Numbers | Subtract |
| Numbers | Multiply |
| Numbers | Divide |
| Numbers | The remainder of dividing |
| Numbers | Bitwise AND of |
| Numbers | Bitwise OR of |
| Numbers | Bitwise XOR of |
| Numbers | Bitwise NOT of |
Arithmetic operators take any numeric type. No type coercion is
performed if the two operands are of the same numeric type. Otherwise,
if the types differ, then the value of the smaller
of the two types is promoted to wider type of the
other value. (Wider in the sense that a type with
more bytes can hold a wider range of values.) For example, for INT
and BIGINT
operands, the INT
is promoted to BIGINT
. For INT
and FLOAT
operands, the INT
is promoted to FLOAT
. Note that our query contained (1 - deductions[…])
. Since the deductions are
FLOATS
, the 1
was promoted to FLOAT
.
You have to be careful about data overflow or underflow when doing arithmetic. Hive follows the rules for the underlying Java types, where no attempt is made to automatically convert a result to a wider type if one exists, when overflow or underflow will occur. Multiplication and division are most likely to trigger this problem.
It pays to be aware of the ranges of your numeric data values, whether or not those values approach the upper or lower range limits of the types you are using in the corresponding schema, and what kinds of calculations people might do with the data.
If you are concerned about overflow or underflow, consider using wider types in the schema. The drawback is the extra memory each data value will occupy.
You can also convert values to wider types in specific expressions, called casting. See Table 6-2 below and Casting for details.
Finally, it is sometimes useful to scale data values, such as dividing by powers of 10, using log values, and so on. Scaling can also improve the accuracy and numerical stability of algorithms used in certain machine learning calculations, for example.
Our tax-deduction example also uses a built-in mathematical
function, round()
, for finding the
nearest integer for a DOUBLE
value.
Table 6-2 describes the built-in mathematical functions, as of Hive v0.8.0, for working with single columns of data.
Table 6-2. Mathematical functions
Note the functions floor
,
round
, and ceil
(“ceiling”) for converting DOUBLE
to BIGINT
, which is floating-point numbers to
integer numbers. These functions are the preferred technique, rather
than using the cast operator we mentioned
above.
Also, there are functions for converting integers to strings in different bases (e.g., hexadecimal).
A special kind of function is the
aggregate function that returns a single value
resulting from some computation over many rows. More precisely, this
is the User Defined Aggregate Function, as we’ll
see in Aggregate Functions. Perhaps the two best
known examples are count
, which
counts the number of rows (or values for a specific column), and
avg
, which returns the average
value of the specified column values.
Here is a query that counts the number of our example employees and averages their salaries:
hive
>
SELECT
count
(
*
),
avg
(
salary
)
FROM
employees
;
4
77500
.
0
We’ll see other examples when we discuss GROUP BY
in the section GROUP BY Clauses.
Table 6-3 lists Hive’s built-in aggregate functions.
Table 6-3. Aggregate functions
You can usually improve the performance of aggregation by
setting the following property to true, hive.map.aggr
, as shown here:
hive
>
SET
hive
.
map
.
aggr
=
true
;
hive
>
SELECT
count
(
*
),
avg
(
salary
)
FROM
employees
;
This setting will attempt to do “top-level” aggregation in the
map phase, as in this example. (An aggregation that isn’t top-level
would be aggregation after performing a GROUP
BY
.) However, this setting will require more memory.
As Table 6-3 shows,
several functions accept DISTINCT …
expressions. For example, we could count the unique stock symbols this
way:
hive
>
SELECT
count
(
DISTINCT
symbol
)
FROM
stocks
;
0
Wait, zero?? There is a bug when
trying to use count(DISTINCT col)
when col
is a partition
column. The answer should be 743 for NASDAQ and NYSE, at
least as of early 2010 in the infochimps.org data set we
used.
Note that the Hive wiki currently claims that you can’t use more
than one function(DISTINCT …)
expression in a query. For example, the following is supposed to be
disallowed, but it actually works:
hive
>
SELECT
count
(
DISTINCT
ymd
),
count
(
DISTINCT
volume
)
FROM
stocks
;
12110
26144
So, there are 12,110 trading days of data, over 40 years worth.
The “inverse” of aggregate functions are so-called table generating functions, which take single columns and expand them to multiple columns or rows. We will discuss them extensively in Table Generating Functions, but to complete the contents of this section, we will discuss them briefly now and list the few built-in table generating functions available in Hive.
To explain by way of an example, the following query converts
the subordinate
array in each
employees
record into zero or more
new records. If an employee record has an empty subordinates
array, then no new records are
generated. Otherwise, one new record per subordinate is
generated:
hive
>
SELECT
explode
(
subordinates
)
AS
sub
FROM
employees
;
Mary
Smith
Todd
Jones
Bill
King
We used a column alias, sub
, defined using the AS sub
clause. When using table generating
functions, column aliases are required by Hive.
There are many other particular details that you must understand to
use these functions correctly. We’ll wait until Table Generating Functions to discuss the
details.
Table 6-4 lists the built-in table generating functions.
Table 6-4. Table generating functions
Here is an example that uses parse_url_tuple
where we assume a url_table
exists that contains a column of
URLs called url
:
SELECT
parse_url_tuple
(
url
,
'HOST'
,
'PATH'
,
'QUERY'
)
as
(
host
,
path
,
query
)
FROM
url_table
;
Compare parse_url_tuple
with
parse_url
in Table 6-5 below.
Table 6-5
describes the rest of the built-in functions for working with strings,
maps, arrays, JSON, and timestamps, with or without the recently
introduced TIMESTAMP
type (see
Primitive Data Types).
Table 6-5. Other built-in functions
Return type | Signature | Description |
---|---|---|
| | |
| | |
| | Return a reverse copy of the string. |
| | Return the string resulting from |
|
| Like |
| | Return the substring of |
| | Return the substring of |
| | Return the string that results from converting
all characters of |
| | A synonym for |
| | Return the string that results from converting
all characters of |
| | A synonym for |
| | Return the string that results from removing
whitespace from both ends of |
| | Return the string resulting from trimming spaces
from the beginning (lefthand side) of |
| | Return the string resulting from trimming spaces
from the end (righthand side) of |
| | Return the string resulting from replacing all
substrings in |
| | Returns the substring for the index’s match using
the |
| | Extracts the specified part from a URL. It takes
a URL and the |
| | |
| | Return the number of elements in the
|
value of type | | Convert (“cast”) the result of the expression
|
| | Convert the number of seconds from the Unix epoch
(1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time zone in
the format of |
| | Return the date part of a timestamp string, e.g.,
|
| | Return the year part as an |
| | Return the month part as an |
| | Return the day part as an |
| | Extract the JSON object from a JSON string based
on the given JSON path, and return the JSON string of the
extracted object.
|
|
| Returns |
| | Repeats |
| | Returns the integer value for the first ASCII
character in the string |
| | Returns |
| | Returns |
| | Returns an array of substrings of |
| | Returns the index of the comma-separated string
where |
| | Returns the index of |
| | Returns the index of |
| | Creates a |
| | Splits |
| | Estimates the top-K n-grams in the text. |
| | Like |
BOOLEAN | | |
[a] See http://docs.oracle.com/javase/tutorial/essential/regex/ for more on Java regular expression syntax. |
Note that the time-related functions (near the end of the table)
take integer or string arguments. As of Hive v0.8.0, these functions
also take TIMESTAMP
arguments, but
they will continue to take integer or string arguments for backwards
compatibility.
The results of a typical query can return a large number
of rows. The LIMIT
clause puts an
upper limit on the number of rows returned:
hive
>
SELECT
upper
(
name
),
salary
,
deductions
[
"Federal Taxes"
],
>
round
(
salary
*
(
1
-
deductions
[
"Federal Taxes"
]))
FROM
employees
>
LIMIT
2
;
JOHN
DOE
100000
.
0
0
.
2
80000
MARY
SMITH
80000
.
0
0
.
2
64000
You can think of the previous example query as returning a
new relation with new columns, some of which are anonymous results of
manipulating columns in employees
. It’s sometimes useful to give
those anonymous columns a name, called a column
alias. Here is the previous query with column aliases for the
third and fourth columns returned by the query, fed_taxes
and salary_minus_fed_taxes
, respectively:
hive
>
SELECT
upper
(
name
),
salary
,
deductions
[
"Federal Taxes"
]
as
fed_taxes
,
>
round
(
salary
*
(
1
-
deductions
[
"Federal Taxes"
]))
as
salary_minus_fed_taxes
>
FROM
employees
LIMIT
2
;
JOHN
DOE
100000
.
0
0
.
2
80000
MARY
SMITH
80000
.
0
0
.
2
64000
The column alias feature is especially useful in nested select statements. Let’s use the previous example as a nested query:
hive
>
FROM
(
>
SELECT
upper
(
name
),
salary
,
deductions
[
"Federal Taxes"
]
as
fed_taxes
,
>
round
(
salary
*
(
1
-
deductions
[
"Federal Taxes"
]))
as
salary_minus_fed_taxes
>
FROM
employees
>
)
e
>
SELECT
e
.
name
,
e
.
salary_minus_fed_taxes
>
WHERE
e
.
salary_minus_fed_taxes
>
70000
;
JOHN
DOE
100000
.
0
0
.
2
80000
The previous result set is aliased as e
, from which we perform a second query to
select the name
and the salary_minus_fed_taxes
, where the latter is
greater than 70,000. (We’ll cover WHERE
clauses in WHERE Clauses below.)
The CASE … WHEN … THEN
clauses are like if
statements for
individual columns in query results. For example:
hive
>
SELECT
name
,
salary
,
>
CASE
>
WHEN
salary
<
50000
.
0
THEN
'low'
>
WHEN
salary
>=
50000
.
0
AND
salary
<
70000
.
0
THEN
'middle'
>
WHEN
salary
>=
70000
.
0
AND
salary
<
100000
.
0
THEN
'high'
>
ELSE
'very high'
>
END
AS
bracket
FROM
employees
;
John
Doe
100000
.
0
very
high
Mary
Smith
80000
.
0
high
Todd
Jones
70000
.
0
high
Bill
King
60000
.
0
middle
Boss
Man
200000
.
0
very
high
Fred
Finance
150000
.
0
very
high
Stacy
Accountant
60000
.
0
middle
...
If you have been running the queries in this book so far, you have probably noticed that a MapReduce job is started in most cases. Hive implements some kinds of queries without using MapReduce, in so-called local mode, for example:
SELECT
*
FROM
employees
;
In this case, Hive can simply read the records from employees
and dump the formatted output to the
console.
This even works for WHERE
clauses that only filter on partition keys, with or without LIMIT
clauses:
SELECT
*
FROM
employees
WHERE
country
=
'US'
AND
state
=
'CA'
LIMIT
100
;
Furthermore, Hive will attempt to run other operations in local
mode if the hive.exec.mode.local.auto
property is set to true
:
set
hive
.
exec
.
mode
.
local
.
auto
=
true
;
Otherwise, Hive uses MapReduce to run all other queries.
Trust us, you want to add set
hive.exec.mode.local.auto=true;
to your
$HOME/.hiverc file.
While SELECT
clauses
select columns, WHERE
clauses are
filters; they select which records to return. Like SELECT
clauses, we have already used many simple
examples of WHERE
clauses before
defining the clause, on the assumption you have seen them before. Now
we’ll explore them in a bit more detail.
WHERE
clauses use
predicate expressions, applying predicate
operators, which we’ll describe in a moment, to columns.
Several predicate expressions can be joined with AND
and OR
clauses. When the predicate expressions evaluate to true, the
corresponding rows are retained in the output.
We just used the following example that restricts the results to employees in the state of California:
SELECT
*
FROM
employees
WHERE
country
=
'US'
AND
state
=
'CA'
;
The predicates can reference the same variety of computations over
column values that can be used in SELECT
clauses. Here we adapt our previously
used query involving Federal Taxes, filtering for
those rows where the salary minus the federal taxes is greater than
70,000:
hive
>
SELECT
name
,
salary
,
deductions
[
"Federal Taxes"
],
>
salary
*
(
1
-
deductions
[
"Federal Taxes"
])
>
FROM
employees
>
WHERE
round
(
salary
*
(
1
-
deductions
[
"Federal Taxes"
]))
>
70000
;
John
Doe
100000
.
0
0
.
2
80000
.
0
This query is a bit ugly, because the complex expression on the
second line is duplicated in the WHERE
clause. The following variation eliminates the duplication, using a column
alias, but unfortunately it’s not valid:
hive
>
SELECT
name
,
salary
,
deductions
[
"Federal Taxes"
],
>
salary
*
(
1
-
deductions
[
"Federal Taxes"
])
as
salary_minus_fed_taxes
>
FROM
employees
>
WHERE
round
(
salary_minus_fed_taxes
)
>
70000
;
FAILED
:
Error
in
semantic
analysis
:
Line
4
:
13
Invalid
table
alias
or
column
reference
'salary_minus_fed_taxes'
:
(
possible
column
names
are
:
name
,
salary
,
subordinates
,
deductions
,
address
)
As the error message says, we can’t reference column aliases
in the WHERE
clause. However, we can
use a nested SELECT
statement:
hive
>
SELECT
e
.
*
FROM
>
(
SELECT
name
,
salary
,
deductions
[
"Federal Taxes"
]
as
ded
,
>
salary
*
(
1
-
deductions
[
"Federal Taxes"
])
as
salary_minus_fed_taxes
>
FROM
employees
)
e
>
WHERE
round
(
e
.
salary_minus_fed_taxes
)
>
70000
;
John
Doe
100000
.
0
0
.
2
80000
.
0
Boss
Man
200000
.
0
0
.
3
140000
.
0
Fred
Finance
150000
.
0
0
.
3
105000
.
0
Table 6-6 describes the
predicate operators, which are also used in JOIN … ON
and HAVING
clauses.
Table 6-6. Predicate operators
We’ll discuss LIKE
and RLIKE
in detail below (LIKE and RLIKE). First, let’s point out an issue with comparing
floating-point numbers that you should understand.
A common gotcha arises when you compare floating-point
numbers of different types (i.e., FLOAT
versus DOUBLE
). Consider the following query of the
employees
table, which is designed to
return the employee’s name, salary, and federal taxes deduction, but
only if that tax deduction exceeds 0.2 (20%) of his or her
salary:
hive
>
SELECT
name
,
salary
,
deductions
[
'Federal Taxes'
]
>
FROM
employees
WHERE
deductions
[
'Federal Taxes'
]
>
0
.
2
;
John
Doe
100000
.
0
0
.
2
Mary
Smith
80000
.
0
0
.
2
Boss
Man
200000
.
0
0
.
3
Fred
Finance
150000
.
0
0
.
3
Wait! Why are records with deductions['Federal Taxes'] = 0.2
being
returned?
Is it a Hive bug? There is a bug filed against Hive for this issue, but it actually reflects the behavior of the internal representation of floating-point numbers when they are compared and it affects almost all software written in most languages on all modern digital computers (see https://issues.apache.org/jira/browse/HIVE-2586).
When you write a floating-point literal value like 0.2, Hive uses
a DOUBLE
to hold the value. We
defined the deductions
map values to
be FLOAT
, which means that Hive will
implicitly convert the tax deduction value to DOUBLE
to do the comparison. This should work,
right?
Actually, it doesn’t work. Here’s why. The number 0.2 can’t be
represented exactly in a FLOAT
or
DOUBLE
. (See http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
for an in-depth discussion of floating-point number issues.) In this
particular case, the closest exact value is just slightly greater than
0.2, with a few nonzero bits at the least significant end of the
number.
To simplify things a bit, let’s say that 0.2 is actually 0.2000001
for FLOAT
and 0.200000000001 for
DOUBLE
, because an 8-byte DOUBLE
has more significant digits (after the
decimal point). When the FLOAT
value
from the table is converted to DOUBLE
by Hive, it produces the DOUBLE
value
0.200000100000, which is greater than 0.200000000001. That’s why the
query results appear to use >=
not
>
!
This issue is not unique to Hive nor Java, in which Hive is implemented. Rather, it’s a general problem for all systems that use the IEEE standard for encoding floating-point numbers!
However, there are two workarounds we can use in Hive.
First, if we read the data from a TEXTFILE
(see Chapter 15),
which is what we have been assuming so far, then Hive reads the string
“0.2” from the data file and converts it to a real number. We could use
DOUBLE
instead of FLOAT
in our schema. Then we would be
comparing a DOUBLE
for the deductions['Federal Taxes']
with a double for
the literal 0.2
. However, this change
will increase the memory footprint of our queries. Also, we can’t simply
change the schema like this if the data file is a binary file format
like SEQUENCEFILE
(discussed in Chapter 15).
The second workaround is to explicitly cast the 0.2 literal value
to FLOAT
. Java has a nice way of
doing this: you append the letter f
or F
to the end of the number (e.g.,
0.2f
). Unfortunately, Hive doesn’t
support this syntax; we have to use the cast
operator.
Here is a modified query that casts the 0.2 literal value to
FLOAT
. With this change, the expected
results are returned by the query:
hive
>
SELECT
name
,
salary
,
deductions
[
'Federal Taxes'
]
FROM
employees
>
WHERE
deductions
[
'Federal Taxes'
]
>
cast
(
0
.
2
AS
FLOAT
);
Boss
Man
200000
.
0
0
.
3
Fred
Finance
150000
.
0
0
.
3
Note the syntax inside the cast
operator: number AS
FLOAT
.
Actually, there is also a third solution: avoid floating-point numbers for anything involving money.
Use extreme caution when comparing floating-point numbers. Avoid all implicit casts from smaller to wider types.
Table 6-6 describes the
LIKE
and RLIKE
predicate operators. You have probably
seen LIKE
before, a standard SQL
operator. It lets us match on strings that begin with or end with a
particular substring, or when the substring appears anywhere within the
string.
For example, the following three queries select the employee names
and addresses where the street ends with Ave.
, the city begins with O
, and the street contains Chicago
:
hive
>
SELECT
name
,
address
.
street
FROM
employees
WHERE
address
.
street
LIKE
'%Ave.'
;
John
Doe
1
Michigan
Ave
.
Todd
Jones
200
Chicago
Ave
.
hive
>
SELECT
name
,
address
.
city
FROM
employees
WHERE
address
.
city
LIKE
'O%'
;
Todd
Jones
Oak
Park
Bill
King
Obscuria
hive
>
SELECT
name
,
address
.
street
FROM
employees
WHERE
address
.
street
LIKE
'%Chi%'
;
Todd
Jones
200
Chicago
Ave
.
A Hive extension is the RLIKE
clause, which lets us use Java regular expressions,
a more powerful minilanguage for specifying matches. The rich details of
regular expression syntax and features are beyond the scope of this
book. The entry for RLIKE
in Table 6-6 provides links to resources with
more details on regular expressions. Here, we demonstrate their use with
an example, which finds all the employees whose street contains the word
Chicago
or Ontario
:
hive
>
SELECT
name
,
address
.
street
>
FROM
employees
WHERE
address
.
street
RLIKE
'.*(Chicago|Ontario).*'
;
Mary
Smith
100
Ontario
St
.
Todd
Jones
200
Chicago
Ave
.
The string after the RLIKE
keyword has the following interpretation. A period (.) matches any
character and a star (*) means repeat the “thing to the left” (period,
in the two cases shown) zero to many times. The expression (x|y) means
match either x or y.
Hence, there might be no characters before “Chicago” or “Ontario”
and there might be no characters after them. Of course, we could have
written this particular example with two LIKE
clauses:
SELECT
name
,
address
FROM
employees
WHERE
address
.
street
LIKE
'%Chicago%'
OR
address
.
street
LIKE
'%Ontario%'
;
General regular expression matches will let us express much richer
matching criteria that would become very unwieldy with joined LIKE
clauses such as these.
For more details about regular expressions as implemented by Hive using Java, see the documentation for the Java regular expression syntax at http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html or see Regular Expression Pocket Reference by Tony Stubblebine (O’Reilly), Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (O’Reilly), or Mastering Regular Expressions, 3rd Edition, by Jeffrey E.F. Friedl (O’Reilly).
The GROUP BY statement is often used in conjunction with aggregate functions to group the result set by one or more columns and then perform an aggregation over each group.
Let’s return to the stocks
table
we defined in External Tables. The following query
groups stock records for Apple by year, then averages the closing price
for each year:
hive
>
SELECT
year
(
ymd
),
avg
(
price_close
)
FROM
stocks
>
WHERE
exchange
=
'NASDAQ'
AND
symbol
=
'AAPL'
>
GROUP
BY
year
(
ymd
);
1984
25
.
578625440597534
1985
20
.
193676221040867
1986
32
.
46102808021274
1987
53
.
88968399108163
1988
41
.
540079275138766
1989
41
.
65976212516664
1990
37
.
56268799823263
1991
52
.
49553383386182
1992
54
.
80338610251119
1993
41
.
02671956450572
1994
34
.
0813495847914
...
The HAVING
clause lets
you constrain the groups produced by GROUP
BY
in a way that could be expressed with a subquery, using a
syntax that’s easier to express. Here’s the previous query with an
additional HAVING
clause that limits
the results to years where the average closing price was greater than
$50.0:
hive
>
SELECT
year
(
ymd
),
avg
(
price_close
)
FROM
stocks
>
WHERE
exchange
=
'NASDAQ'
AND
symbol
=
'AAPL'
>
GROUP
BY
year
(
ymd
)
>
HAVING
avg
(
price_close
)
>
50
.
0
;
1987
53
.
88968399108163
1991
52
.
49553383386182
1992
54
.
80338610251119
1999
57
.
77071460844979
2000
71
.
74892876261757
2005
52
.
401745992993554
...
Without the HAVING
clause, this
query would require a nested SELECT
statement:
hive
>
SELECT
s2
.
year
,
s2
.
avg
FROM
>
(
SELECT
year
(
ymd
)
AS
year
,
avg
(
price_close
)
AS
avg
FROM
stocks
>
WHERE
exchange
=
'NASDAQ'
AND
symbol
=
'AAPL'
>
GROUP
BY
year
(
ymd
))
s2
>
WHERE
s2
.
avg
>
50
.
0
;
1987
53
.
88968399108163
...
Hive supports the classic SQL JOIN
statement, but only
equi-joins are supported.
In an inner JOIN, records are discarded unless join criteria finds
matching records in every table being joined. For example, the following
query compares Apple (symbol AAPL
)
and IBM (symbol IBM
). The
stocks table is joined against itself, a
self-join, where the dates, ymd
(year-month-day) values must be equal in
both tables. We say that the ymd
columns are the join keys in this query:
hive
>
SELECT
a
.
ymd
,
a
.
price_close
,
b
.
price_close
>
FROM
stocks
a
JOIN
stocks
b
ON
a
.
ymd
=
b
.
ymd
>
WHERE
a
.
symbol
=
'AAPL'
AND
b
.
symbol
=
'IBM'
;
2010
-
01
-
04
214
.
01
132
.
45
2010
-
01
-
05
214
.
38
130
.
85
2010
-
01
-
06
210
.
97
130
.
0
2010
-
01
-
07
210
.
58
129
.
55
2010
-
01
-
08
211
.
98
130
.
85
2010
-
01
-
11
210
.
11
129
.
48
...
The ON
clause specifies the
conditions for joining records between the two tables. The WHERE
clause limits the lefthand table to
AAPL
records and the righthand table
to IBM
records. You can also see that
using table aliases for the two occurrences of stocks
is essential in this query.
As you may know, IBM is an older company than Apple. It has been a publicly traded stock for much longer than Apple. However, since this is an inner JOIN, no IBM records will be returned older than September 7, 1984, which was the first day that Apple was publicly traded!
Standard SQL allows a non-equi-join on the join keys, such as the following example that shows Apple versus IBM, but with all older records for Apple paired up with each day of IBM data. It would be a lot of data (Example 6-1)!
Example 6-1. Query that will not work in Hive
SELECT
a
.
ymd
,
a
.
price_close
,
b
.
price_close
FROM
stocks
a
JOIN
stocks
b
ON
a
.
ymd
<=
b
.
ymd
WHERE
a
.
symbol
=
'AAPL'
AND
b
.
symbol
=
'IBM'
;
This is not valid in Hive, primarily because it is difficult to implement these kinds of joins in MapReduce. It turns out that Pig offers a cross product feature that makes it possible to implement this join, even though Pig’s native join feature doesn’t support it, either.
Also, Hive does not currently support using OR
between predicates in ON
clauses.
To see a nonself join, let’s introduce the corresponding dividends
data, also available from infochimps.org, as described in External Tables:
CREATE
EXTERNAL
TABLE
IF
NOT
EXISTS
dividends
(
ymd
STRING
,
dividend
FLOAT
)
PARTITIONED
BY
(
exchange
STRING
,
symbol
STRING
)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
','
;
Here is an inner JOIN between stocks
and dividends
for Apple, where we use the ymd
and symbol
columns as join keys:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
;
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
08
-
10
AAPL
48
.
25
0
.
015
1987
-
11
-
17
AAPL
35
.
0
0
.
02
...
1995
-
02
-
13
AAPL
43
.
75
0
.
03
1995
-
05
-
26
AAPL
42
.
69
0
.
03
1995
-
08
-
16
AAPL
44
.
5
0
.
03
1995
-
11
-
21
AAPL
38
.
63
0
.
03
Yes, Apple paid a dividend years ago and only recently announced it would start doing so again! Note that because we have an inner JOIN, we only see records approximately every three months, the typical schedule of dividend payments, which are announced when reporting quarterly results.
You can join more than two tables together. Let’s compare Apple, IBM, and GE side by side:
hive
>
SELECT
a
.
ymd
,
a
.
price_close
,
b
.
price_close
,
c
.
price_close
>
FROM
stocks
a
JOIN
stocks
b
ON
a
.
ymd
=
b
.
ymd
>
JOIN
stocks
c
ON
a
.
ymd
=
c
.
ymd
>
WHERE
a
.
symbol
=
'AAPL'
AND
b
.
symbol
=
'IBM'
AND
c
.
symbol
=
'GE'
;
2010
-
01
-
04
214
.
01
132
.
45
15
.
45
2010
-
01
-
05
214
.
38
130
.
85
15
.
53
2010
-
01
-
06
210
.
97
130
.
0
15
.
45
2010
-
01
-
07
210
.
58
129
.
55
16
.
25
2010
-
01
-
08
211
.
98
130
.
85
16
.
6
2010
-
01
-
11
210
.
11
129
.
48
16
.
76
...
Most of the time, Hive will use a separate MapReduce job for each
pair of things to join. In this example, it would use one job for tables
a
and b
, then a second job to join the output of the
first join with c
.
Why not join b
and c
first? Hive goes from left to
right.
However, this example actually benefits from an optimization we’ll discuss next.
In the previous example, every ON
clause uses a.ymd
as one of the join keys. In this case,
Hive can apply an optimization where it joins all three tables in a
single MapReduce job. The optimization would also
be used if b.ymd
were used in both
ON
clauses.
When joining three or more tables, if every ON
clause uses the same join key, a single
MapReduce job will be used.
Hive also assumes that the last table in the query is the largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on individual records. Therefore, you should structure your join queries so the largest table is last.
Recall our previous join between stocks
and dividends
. We actually made the mistake of
using the smaller dividends
table
last:
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
FROM
stocks
s
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
WHERE
s
.
symbol
=
'AAPL'
;
We should switch the positions of stocks
and dividends
:
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
FROM
dividends
d
JOIN
stocks
s
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
WHERE
s
.
symbol
=
'AAPL'
;
It turns out that these data sets are too small to see a noticeable performance difference, but for larger data sets, you’ll want to exploit this optimization.
Fortunately, you don’t have to put the largest table last in the query. Hive also provides a “hint” mechanism to tell the query optimizer which table should be streamed:
SELECT
/*+ STREAMTABLE(s) */
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
FROM
stocks
s
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
WHERE
s
.
symbol
=
'AAPL'
;
Now Hive will attempt to stream the stocks
table, even though it’s not the last
table in the query.
There is another important optimization called map-side joins that we’ll return to in Map-side Joins.
The left-outer join is indicated by adding the LEFT OUTER
keywords:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
LEFT
OUTER
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
;
...
1987
-
05
-
01
AAPL
80
.
0
NULL
1987
-
05
-
04
AAPL
79
.
75
NULL
1987
-
05
-
05
AAPL
80
.
25
NULL
1987
-
05
-
06
AAPL
80
.
0
NULL
1987
-
05
-
07
AAPL
80
.
25
NULL
1987
-
05
-
08
AAPL
79
.
0
NULL
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
05
-
12
AAPL
75
.
5
NULL
1987
-
05
-
13
AAPL
78
.
5
NULL
1987
-
05
-
14
AAPL
79
.
25
NULL
1987
-
05
-
15
AAPL
78
.
25
NULL
1987
-
05
-
18
AAPL
75
.
75
NULL
1987
-
05
-
19
AAPL
73
.
25
NULL
1987
-
05
-
20
AAPL
74
.
5
NULL
...
In this join, all the records from the
lefthand table that match the WHERE
clause are returned. If the righthand table doesn’t have a record that
matches the ON
criteria, NULL
is used for each column selected from the
righthand table.
Hence, in this result set, we see that the every Apple stock
record is returned and the d.dividend
value is usually NULL
, except on days when a dividend
was paid (May 11th, 1987, in this
output).
Before we discuss the other outer joins, let’s discuss a gotcha you should understand.
Recall what we said previously about speeding up queries by adding
partition filters in the WHERE
clause. To speed up our previous query, we might choose to add
predicates that select on the exchange
in both tables:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
LEFT
OUTER
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
>
AND
s
.
exchange
=
'NASDAQ'
AND
d
.
exchange
=
'NASDAQ'
;
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
08
-
10
AAPL
48
.
25
0
.
015
1987
-
11
-
17
AAPL
35
.
0
0
.
02
1988
-
02
-
12
AAPL
41
.
0
0
.
02
1988
-
05
-
16
AAPL
41
.
25
0
.
02
...
However, the output has changed, even though we thought we were
just adding an optimization! We’re back to having approximately four
stock records per year and we have non-NULL
entries
for all the dividend values. In other words, we are back to the original
inner join!
This is actually common behavior for all outer joins in most SQL
implementations. It occurs because the JOIN
clause is evaluated first, then the
results are passed through the WHERE
clause. By the time the WHERE
clause
is reached, d.exchange
is NULL
most of the time, so the “optimization”
actually filters out all records except those on the day of dividend
payments.
One solution is straightforward; remove the clauses in the
WHERE
clause that reference the
dividends
table:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
LEFT
OUTER
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
AND
s
.
exchange
=
'NASDAQ'
;
...
1987
-
05
-
07
AAPL
80
.
25
NULL
1987
-
05
-
08
AAPL
79
.
0
NULL
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
05
-
12
AAPL
75
.
5
NULL
1987
-
05
-
13
AAPL
78
.
5
NULL
...
This isn’t very satisfactory. You might wonder if you can move the
predicates from the WHERE
clause into
the ON
clause, at least the partition
filters. This does not work for outer joins,
despite documentation on the Hive Wiki that claims it should work
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins).
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
LEFT
OUTER
JOIN
dividends
d
>
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
AND
s
.
symbol
=
'AAPL'
AND
s
.
exchange
=
'NASDAQ'
AND
d
.
exchange
=
'NASDAQ'
;
...
1962
-
01
-
02
GE
74
.
75
NULL
1962
-
01
-
02
IBM
572
.
0
NULL
1962
-
01
-
03
GE
74
.
0
NULL
1962
-
01
-
03
IBM
577
.
0
NULL
1962
-
01
-
04
GE
73
.
12
NULL
1962
-
01
-
04
IBM
571
.
25
NULL
1962
-
01
-
05
GE
71
.
25
NULL
1962
-
01
-
05
IBM
560
.
0
NULL
...
The partition filters are ignored for
OUTER JOINTS
. However, using such filter predicates
in ON
clauses for inner joins
does work!
Fortunately, there is solution that works for all joins; use
nested SELECT
statements:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
FROM
>
(
SELECT
*
FROM
stocks
WHERE
symbol
=
'AAPL'
AND
exchange
=
'NASDAQ'
)
s
>
LEFT
OUTER
JOIN
>
(
SELECT
*
FROM
dividends
WHERE
symbol
=
'AAPL'
AND
exchange
=
'NASDAQ'
)
d
>
ON
s
.
ymd
=
d
.
ymd
;
...
1988
-
02
-
10
AAPL
41
.
0
NULL
1988
-
02
-
11
AAPL
40
.
63
NULL
1988
-
02
-
12
AAPL
41
.
0
0
.
02
1988
-
02
-
16
AAPL
41
.
25
NULL
1988
-
02
-
17
AAPL
41
.
88
NULL
...
The nested SELECT
statement
performs the required “push down” to apply the partition filters before
data is joined.
Right-outer joins return all records in the righthand
table that match the WHERE
clause.
NULL
is used for fields of missing
records in the lefthand table.
Here we switch the places of stocks
and dividends
and perform a righthand join, but
leave the SELECT
statement
unchanged:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
dividends
d
RIGHT
OUTER
JOIN
stocks
s
ON
d
.
ymd
=
s
.
ymd
AND
d
.
symbol
=
s
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
;
...
1987
-
05
-
07
AAPL
80
.
25
NULL
1987
-
05
-
08
AAPL
79
.
0
NULL
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
05
-
12
AAPL
75
.
5
NULL
1987
-
05
-
13
AAPL
78
.
5
NULL
...
Finally, a full-outer join returns all records from all
tables that match the WHERE
clause.
NULL
is used for fields in missing
records in either table.
If we convert the previous query to a full-outer join, we’ll actually get the same results, since there is never a case where a dividend record exists without a matching stock record:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
dividends
d
FULL
OUTER
JOIN
stocks
s
ON
d
.
ymd
=
s
.
ymd
AND
d
.
symbol
=
s
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
;
...
1987
-
05
-
07
AAPL
80
.
25
NULL
1987
-
05
-
08
AAPL
79
.
0
NULL
1987
-
05
-
11
AAPL
77
.
0
0
.
015
1987
-
05
-
12
AAPL
75
.
5
NULL
1987
-
05
-
13
AAPL
78
.
5
NULL
...
A left semi-join returns records from
the lefthand table if records are found in the righthand table that
satisfy the ON
predicates. It’s a
special, optimized case of the more general inner join. Most SQL
dialects support an IN ... EXISTS
construct to do the same thing. For instance, the following query in
Example 6-2 attempts to return stock records only on the
days of dividend payments, but it doesn’t work in Hive.
Example 6-2. Query that will not work in Hive
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
FROM
stocks
s
WHERE
s
.
ymd
,
s
.
symbol
IN
(
SELECT
d
.
ymd
,
d
.
symbol
FROM
dividends
d
);
Instead, you use the following LEFT SEMI
JOIN
syntax:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
>
FROM
stocks
s
LEFT
SEMI
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
;
...
1962
-
11
-
05
IBM
361
.
5
1962
-
08
-
07
IBM
373
.
25
1962
-
05
-
08
IBM
459
.
5
1962
-
02
-
06
IBM
551
.
5
Note that the SELECT
and
WHERE
clauses can’t reference columns
from the righthand table.
Right semi-joins are not supported in Hive.
The reason semi-joins are more efficient than the more general inner join is as follows. For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table as soon as any match is found. At that point, the selected columns from the lefthand table record can be projected.
A Cartesian product is a join where all the tuples in the left side of the join are paired with all the tuples of the right table. If the left table has 5 rows and the right table has 6 rows, 30 rows of output will be produced:
SELECTS
*
FROM
stocks
JOIN
dividends
;
Using the table of stocks and dividends, it is hard to find a reason for a join of this type, as the dividend of one stock is not usually paired with another. Additionally, Cartesian products create a lot of data. Unlike other join types, Cartesian products are not executed in parallel, and they are not optimized in any way using MapReduce.
It is critical to point out that using the wrong join syntax will cause a long, slow-running Cartesian product query. For example, the following query will be optimized to an inner join in many databases, but not in Hive:
hive
>
SELECT
*
FROM
stocks
JOIN
dividends
>
WHERE
stock
.
symbol
=
dividends
.
symbol
and
stock
.
symbol
=
'AAPL'
;
In Hive, this query computes the full
Cartesian product before applying the WHERE
clause.
It could take a very long time to finish. When the property hive.mapred.mode
is set to strict
, Hive prevents users from inadvertently
issuing a Cartesian product query. We’ll discuss the features of
strict
mode more extensively in Chapter 10.
If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.
The joins between stocks and dividends can exploit this optimization, as the dividends data set is small enough to be cached.
Before Hive v0.7, it was necessary to add a hint to the query to enable this optimization. Returning to our inner join example:
SELECT
/*+ MAPJOIN(d) */
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
FROM
stocks
s
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
WHERE
s
.
symbol
=
'AAPL'
;
Running this query versus the original on a fast MacBook Pro laptop yielded times of approximately 23 seconds versus 33 seconds for the original unoptimized query, which is roughly 30% faster using our sample stock data.
The hint still works, but it’s now deprecated as of Hive v0.7.
However, you still have to set a property, hive.auto.convert.join
, to true
before Hive will attempt the
optimization. It’s false by default:
hive
>
set
hive
.
auto
.
convert
.
join
=
true
;
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
,
d
.
dividend
>
FROM
stocks
s
JOIN
dividends
d
ON
s
.
ymd
=
d
.
ymd
AND
s
.
symbol
=
d
.
symbol
>
WHERE
s
.
symbol
=
'AAPL'
;
Note that you can also configure the threshold size for table files considered small enough to use this optimization. Here is the default definition of the property (in bytes):
hive
.
mapjoin
.
smalltable
.
filesize
=
25000000
If you always want Hive to attempt this optimization, set one or both of these properties in your $HOME/.hiverc file.
Hive does not support the optimization for right- and full-outer joins.
This optimization can also be used for larger tables under certain
conditions when the data for every
table is bucketed, as discussed in Bucketing Table Data Storage. Briefly, the data must be bucketed on the keys
used in the ON
clause and the number
of buckets for one table must be a multiple of the number of buckets for
the other table. When these conditions are met, Hive can join individual
buckets between tables in the map phase, because it does not need to
fetch the entire contents of one table to match against each bucket in
the other table.
However, this optimization is not turned on by default. It must be
enabled by setting the property hive.optimize.bucketmapjoin
:
set
hive
.
optimize
.
bucketmapjoin
=
true
;
If the bucketed tables actually have the same number of buckets and the data is sorted by the join/bucket keys, then Hive can perform an even faster sort-merge join. Once again, properties must be set to enable the optimization:
set
hive
.
input
.
format
=
org
.
apache
.
hadoop
.
hive
.
ql
.
io
.
BucketizedHiveInputFormat
;
set
hive
.
optimize
.
bucketmapjoin
=
true
;
set
hive
.
optimize
.
bucketmapjoin
.
sortedmerge
=
true
;
The ORDER BY
clause is
familiar from other SQL dialects. It performs a total
ordering of the query result set. This means that all the data is passed through a single reducer, which
may take an unacceptably long time to execute for larger data sets.
Hive adds an alternative, SORT
BY
, that orders the data only within each reducer, thereby
performing a local ordering, where each reducer’s
output will be sorted. Better performance is traded for total
ordering.
In both cases, the syntax differs only by the use of the ORDER
or SORT
keyword. You can specify any columns you wish and specify whether or not
the columns are ascending using the ASC
keyword (the default) or descending using the DESC
keyword.
Here is an example using ORDER
BY
:
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
FROM
stocks
s
ORDER
BY
s
.
ymd
ASC
,
s
.
symbol
DESC
;
Here is the same example using SORT
BY
instead:
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
FROM
stocks
s
SORT
BY
s
.
ymd
ASC
,
s
.
symbol
DESC
;
The two queries look almost identical, but if more than one reducer is invoked, the output will be sorted differently. While each reducer’s output files will be sorted, the data will probably overlap with the output of other reducers.
Because ORDER BY
can result in
excessively long run times, Hive will require a LIMIT
clause with ORDER
BY
if the property hive.mapred.mode
is set to strict
. By default, it is set to nonstrict
.
DISTRIBUTE BY
controls
how map output is divided among reducers. All data that flows through a
MapReduce job is organized into key-value pairs. Hive must use this
feature internally when it converts your queries to MapReduce jobs.
Usually, you won’t need to worry about this feature. The exceptions are queries that use the Streaming feature (see Chapter 14) and some stateful UDAFs (User-Defined Aggregate Functions; see Aggregate Functions). There is one other scenario where these clauses are useful.
By default, MapReduce computes a hash on the keys output by mappers
and tries to evenly distribute the key-value pairs among the available
reducers using the hash values. Unfortunately, this means that when we use
SORT BY
, the contents of one reducer’s
output will overlap significantly with the output of the other reducers,
as far as sorted order is concerned, even though the data is sorted
within each reducer’s output.
Say we want the data for each stock symbol to be captured together.
We can use DISTRIBUTE BY
to ensure that
the records for each stock symbol go to the same reducer, then use
SORT BY
to order the data the way we
want. The following query demonstrates this technique:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
>
FROM
stocks
s
>
DISTRIBUTE
BY
s
.
symbol
>
SORT
BY
s
.
symbol
ASC
,
s
.
ymd
ASC
;
1984
-
09
-
07
AAPL
26
.
5
1984
-
09
-
10
AAPL
26
.
37
1984
-
09
-
11
AAPL
26
.
87
1984
-
09
-
12
AAPL
26
.
12
1984
-
09
-
13
AAPL
27
.
5
1984
-
09
-
14
AAPL
27
.
87
1984
-
09
-
17
AAPL
28
.
62
1984
-
09
-
18
AAPL
27
.
62
1984
-
09
-
19
AAPL
27
.
0
1984
-
09
-
20
AAPL
27
.
12
...
Of course, the ASC
keywords could have been omitted as they are the defaults. The ASC
keyword is placed here for reasons that will
be described shortly.
DISTRIBUTE BY
works similar to
GROUP BY
in the sense that it controls
how reducers receive rows for processing, while SORT BY
controls the sorting of data inside the
reducer.
Note that Hive requires that the DISTRIBUTE
BY
clause come before the SORT
BY
clause.
In the previous example, the s.symbol
column was used in the DISTRIBUTE BY
clause, and the s.symbol
and the s.ymd
columns in the SORT BY
clause. Suppose that the same columns
are used in both clauses and all columns are sorted
by ascending order (the default). In this case, the CLUSTER BY
clause is a shor-hand way of
expressing the same query.
For example, let’s modify the previous query to drop sorting by
s.ymd
and use CLUSTER BY
on s.symbol
:
hive
>
SELECT
s
.
ymd
,
s
.
symbol
,
s
.
price_close
>
FROM
stocks
s
>
CLUSTER
BY
s
.
symbol
;
2010
-
02
-
08
AAPL
194
.
12
2010
-
02
-
05
AAPL
195
.
46
2010
-
02
-
04
AAPL
192
.
05
2010
-
02
-
03
AAPL
199
.
23
2010
-
02
-
02
AAPL
195
.
86
2010
-
02
-
01
AAPL
194
.
73
2010
-
01
-
29
AAPL
192
.
06
2010
-
01
-
28
AAPL
199
.
29
2010
-
01
-
27
AAPL
207
.
88
...
Because the sort requirements are removed for the s.ymd
, the output reflects the original order of
the stock data, which is sorted descending.
Using DISTRIBUTE BY ... SORT BY
or the shorthand CLUSTER BY
clauses is
a way to exploit the parallelism of SORT
BY
, yet achieve a total ordering across the output
files.
We briefly mentioned in Primitive Data Types that Hive will perform some implicit conversions, called casts, of numeric data types, as needed. For example, when doing comparisons between two numbers of different types. This topic is discussed more fully in Predicate Operators and Gotchas with Floating-Point Comparisons.
Here we discuss the cast()
function that allows you to explicitly convert a
value of one type to another.
Recall our employees
table uses a
FLOAT
for the salary
column. Now, imagine for a moment that
STRING
was used for that column
instead. How could we work with the values as FLOATS
?
The following example casts the values to FLOAT
before performing a comparison:
SELECT
name
,
salary
FROM
employees
WHERE
cast
(
salary
AS
FLOAT
)
<
100000
.
0
;
The syntax of the cast function is cast(value AS TYPE)
. What would happen in the
example if a salary
value was not a
valid string for a floating-point number? In this case, Hive returns
NULL
.
Note that the preferred way to convert floating-point numbers to
integers is to use the round()
or
floor()
functions listed in Table 6-2, rather than to use the cast
operator.
The new BINARY
type
introduced in Hive v0.8.0 only supports casting BINARY
to STRING
. However, if you know the value is a
number, you can nest cast()
invocations, as in this example where column b
is a BINARY
column:
SELECT
(
2
.
0
*
cast
(
cast
(
b
as
string
)
as
double
))
from
src
;
You can also cast STRING
to
BINARY
.
For very large data sets, sometimes you want to work with a representative sample of a query result, not the whole thing. Hive supports this goal with queries that sample tables organized into buckets.
In the following example, assume the numbers table has one number column with values 1−10.
We can sample using the rand()
function, which returns a random number. In the first two queries, two
distinct numbers are returned for each query. In the third query, no
results are returned:
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
3
OUT
OF
10
ON
rand
())
s
;
2
4
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
3
OUT
OF
10
ON
rand
())
s
;
7
10
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
3
OUT
OF
10
ON
rand
())
s
;
If we bucket on a column instead of rand()
, then identical results are returned on
multiple runs:
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
3
OUT
OF
10
ON
number
)
s
;
2
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
5
OUT
OF
10
ON
number
)
s
;
4
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
3
OUT
OF
10
ON
number
)
s
;
2
The denominator in the bucket clause represents the number of buckets into which data will be hashed. The numerator is the bucket number selected:
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
1
OUT
OF
2
ON
number
)
s
;
2
4
6
8
10
hive
>
SELECT
*
from
numbers
TABLESAMPLE
(
BUCKET
2
OUT
OF
2
ON
number
)
s
;
1
3
5
7
9
Hive offers another syntax for sampling a percentage of blocks of an input path as an alternative to sampling based on rows:
hive
>
SELECT
*
FROM
numbersflat
TABLESAMPLE
(
0
.
1
PERCENT
)
s
;
This sampling is not known to work with all file formats. Also, the smallest unit of sampling is a single HDFS block. Hence, for tables less than the typical block size of 128 MB, all rows will be retuned.
Percentage-based sampling offers a variable to control the seed information for block-based tuning. Different seeds produce different samples:
<property>
<name>
hive.sample.seednumber</name>
<value>
0</value>
<description>
A number used for percentage sampling. By changing this number, user will change the subsets of data sampled.</description>
</property>
From a first look at the TABLESAMPLE
syntax, an astute user might come
to the conclusion that the following query would be equivalent to the
TABLESAMPLE
operation:
hive
>
SELECT
*
FROM
numbersflat
WHERE
number
%
2
=
0
;
2
4
6
8
10
It is true that for most table types, sampling scans through the
entire table and selects every Nth row.
However, if the columns specified in the TABLESAMPLE
clause match the columns in the
CLUSTERED BY
clause, TABLESAMPLE
queries only scan the required
hash partitions of the table:
hive
>
CREATE
TABLE
numbers_bucketed
(
number
int
)
CLUSTERED
BY
(
number
)
INTO
3
BUCKETS
;
hive
>
SET
hive
.
enforce
.
bucketing
=
true
;
hive
>
INSERT
OVERWRITE
TABLE
numbers_bucketed
SELECT
number
FROM
numbers
;
hive
>
dfs
-
ls
/
user
/
hive
/
warehouse
/
mydb
.
db
/
numbers_bucketed
;
/
user
/
hive
/
warehouse
/
mydb
.
db
/
numbers_bucketed
/
000000
_0
/
user
/
hive
/
warehouse
/
mydb
.
db
/
numbers_bucketed
/
000001
_0
/
user
/
hive
/
warehouse
/
mydb
.
db
/
numbers_bucketed
/
000002
_0
hive
>
dfs
-
cat
/
user
/
hive
/
warehouse
/
mydb
.
db
/
numbers_bucketed
/
000001
_0
;
1
7
10
4
Because this table is clustered into three buckets, the following query can be used to sample only one of the buckets efficiently:
hive
>
SELECT
*
FROM
numbers_bucketed
TABLESAMPLE
(
BUCKET
2
OUT
OF
3
ON
NUMBER
)
s
;
1
7
10
4
UNION ALL
combines two or
more tables. Each subquery of the union query must produce the same number
of columns, and for each column, its type must match all the column types
in the same position. For example, if the second column is a FLOAT
, then the second column of all the other
query results must be a FLOAT
.
Here is an example the merges log data:
SELECT
log
.
ymd
,
log
.
level
,
log
.
message
FROM
(
SELECT
l1
.
ymd
,
l1
.
level
,
l1
.
message
,
'Log1'
AS
source
FROM
log1
l1
UNION
ALL
SELECT
l2
.
ymd
,
l2
.
level
,
l2
.
message
,
'Log2'
AS
source
FROM
log1
l2
)
log
SORT
BY
log
.
ymd
ASC
;
UNION
may be used when a clause
selects from the same source table. Logically, the same results could be
achieved with a single SELECT
and
WHERE
clause. This technique increases
readability by breaking up a long complex WHERE
clause into two or more
UNION
queries. However, unless the source table is
indexed, the query will have to make multiple passes over the same source
data. For example:
FROM
(
FROM
src
SELECT
src
.
key
,
src
.
value
WHERE
src
.
key
<
100
UNION
ALL
FROM
src
SELECT
src
.
*
WHERE
src
.
key
>
110
)
unioninput
INSERT
OVERWRITE
DIRECTORY
'/tmp/union.out'
SELECT
unioninput
.
*
[17] At the time of this writing, the Hive Wiki shows an incorrect syntax for specifying columns using regular expressions.