Chapter 6. HiveQL: Queries

After learning the many ways we can define and format tables, let’s learn how to run queries. Of course, we have assumed all along that you have some prior knowledge of SQL. We’ve used some queries already to illustrate several concepts, such as loading query data into other tables in Chapter 5. Now we’ll fill in most of the details. Some special topics will be covered in subsequent chapters.

We’ll move quickly through details that are familiar to users with prior SQL experience and focus on what’s unique to HiveQL, including syntax and feature differences, as well as performance implications.

SELECT … FROM Clauses

SELECT is the projection operator in SQL. The FROM clause identifies from which table, view, or nested query we select records (see Chapter 7).

For a given record, SELECT specifies the columns to keep, as well as the outputs of function calls on one or more columns (e.g., the aggregation functions like count(*)).

Recall again our partitioned employees table:

CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);

Let’s assume we have the same contents we showed in Text File Encoding of Data Values for four employees in the US state of Illinois (abbreviated IL). Here are queries of this table and the output they produce:

hive> SELECT name, salary FROM employees;
John Doe    100000.0
Mary Smith   80000.0
Todd Jones   70000.0
Bill King    60000.0

The following two queries are identical. The second version uses a table alias e, which is not very useful in this query, but becomes necessary in queries with JOINs (see JOIN Statements) where several different tables are used:

hive> SELECT   name,   salary FROM employees;
hive> SELECT e.name, e.salary FROM employees e;

When you select columns that are one of the collection types, Hive uses JSON (JavaScript Object Notation) syntax for the output. First, let’s select the subordinates, an ARRAY, where a comma-separated list surrounded with […] is used. Note that STRING elements of the collection are quoted, while the primitive STRING name column is not:

hive> SELECT name, subordinates FROM employees;
John Doe    ["Mary Smith","Todd Jones"]
Mary Smith  ["Bill King"]
Todd Jones  []
Bill King   []

The deductions is a MAP, where the JSON representation for maps is used, namely a comma-separated list of key:value pairs, surrounded with {...}:

hive> SELECT name, deductions FROM employees;
John Doe    {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith  {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones  {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King   {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}

Finally, the address is a STRUCT, which is also written using the JSON map format:

hive> SELECT name, address FROM employees;
John Doe    {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith  {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones  {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King   {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}

Next, let’s see how to reference elements of collections.

First, ARRAY indexing is 0-based, as in Java. Here is a query that selects the first element of the subordinates array:

hive> SELECT name, subordinates[0] FROM employees;
John Doe    Mary Smith
Mary Smith  Bill King
Todd Jones  NULL
Bill King   NULL

Note that referencing a nonexistent element returns NULL. Also, the extracted STRING values are no longer quoted!

To reference a MAP element, you also use ARRAY[...] syntax, but with key values instead of integer indices:

hive> SELECT name, deductions["State Taxes"] FROM employees;
John Doe    0.05
Mary Smith  0.05
Todd Jones  0.03
Bill King   0.03

Finally, to reference an element in a STRUCT, you use “dot” notation, similar to the table_alias.column mentioned above:

hive> SELECT name, address.city FROM employees;
John Doe    Chicago
Mary Smith  Chicago
Todd Jones  Oak Park
Bill King   Obscuria

These same referencing techniques are also used in WHERE clauses, which we discuss in WHERE Clauses.

Specify Columns with Regular Expressions

We can even use regular expressions to select the columns we want. The following query selects the symbol column and all columns from stocks whose names start with the prefix price:[17]

hive> SELECT symbol, `price.*` FROM stocks;
AAPL    195.69  197.88  194.0   194.12  194.12
AAPL    192.63  196.0   190.85  195.46  195.46
AAPL    196.73  198.37  191.57  192.05  192.05
AAPL    195.17  200.2   194.42  199.23  199.23
AAPL    195.91  196.32  193.38  195.86  195.86
...

We’ll talk more about Hive’s use of regular expressions in the section LIKE and RLIKE.

Computing with Column Values

Not only can you select columns in a table, but you can manipulate column values using function calls and arithmetic expressions.

For example, let’s select the employees’ names converted to uppercase, their salaries, federal taxes percentage, and the value that results if we subtract the federal taxes portion from their salaries and round to the nearest integer. We could call a built-in function map_values to extract all the values from the deductions map and then add them up with the built-in sum function.

The following query is long enough that we’ll split it over two lines. Note the secondary prompt that Hive uses, an indented greater-than sign (>):

hive> SELECT upper(name), salary, deductions["Federal Taxes"],
    > round(salary * (1 - deductions["Federal Taxes"])) FROM employees;
JOHN DOE    100000.0  0.2   80000
MARY SMITH   80000.0  0.2   64000
TODD JONES   70000.0  0.15  59500
BILL KING    60000.0  0.15  51000

Let’s discuss arithmetic operators and then discuss the use of functions in expressions.

Arithmetic Operators

All the typical arithmetic operators are supported. Table 6-1 describes the specific details.

Table 6-1. Arithmetic operators

OperatorTypesDescription

A + B

Numbers

Add A and B.

A - B

Numbers

Subtract B from A.

A * B

Numbers

Multiply A and B.

A / B

Numbers

Divide A with B. If the operands are integer types, the quotient of the division is returned.

A % B

Numbers

The remainder of dividing A with B.

A & B

Numbers

Bitwise AND of A and B.

A | B

Numbers

Bitwise OR of A and B.

A ^ B

Numbers

Bitwise XOR of A and B.

~A

Numbers

Bitwise NOT of A.

Arithmetic operators take any numeric type. No type coercion is performed if the two operands are of the same numeric type. Otherwise, if the types differ, then the value of the smaller of the two types is promoted to wider type of the other value. (Wider in the sense that a type with more bytes can hold a wider range of values.) For example, for INT and BIGINT operands, the INT is promoted to BIGINT. For INT and FLOAT operands, the INT is promoted to FLOAT. Note that our query contained (1 - deductions[…]). Since the deductions are FLOATS, the 1 was promoted to FLOAT.

You have to be careful about data overflow or underflow when doing arithmetic. Hive follows the rules for the underlying Java types, where no attempt is made to automatically convert a result to a wider type if one exists, when overflow or underflow will occur. Multiplication and division are most likely to trigger this problem.

It pays to be aware of the ranges of your numeric data values, whether or not those values approach the upper or lower range limits of the types you are using in the corresponding schema, and what kinds of calculations people might do with the data.

If you are concerned about overflow or underflow, consider using wider types in the schema. The drawback is the extra memory each data value will occupy.

You can also convert values to wider types in specific expressions, called casting. See Table 6-2 below and Casting for details.

Finally, it is sometimes useful to scale data values, such as dividing by powers of 10, using log values, and so on. Scaling can also improve the accuracy and numerical stability of algorithms used in certain machine learning calculations, for example.

Using Functions

Our tax-deduction example also uses a built-in mathematical function, round(), for finding the nearest integer for a DOUBLE value.

Mathematical functions

Table 6-2 describes the built-in mathematical functions, as of Hive v0.8.0, for working with single columns of data.

Table 6-2. Mathematical functions

Return typeSignatureDescription

BIGINT

round(d)

Return the BIGINT for the rounded value of DOUBLE d.

DOUBLE

round(d, N)

Return the DOUBLE for the value of d, a DOUBLE, rounded to N decimal places.

BIGINT

floor(d)

Return the largest BIGINT that is <= d, a DOUBLE.

BIGINT

ceil(d), ceiling(DOUBLE d)

Return the smallest BIGINT that is >= d.

DOUBLE

rand(), rand(seed)

Return a pseudorandom DOUBLE that changes for each row. Passing in an integer seed makes the return value deterministic.

DOUBLE

exp(d)

Return e to the d, a DOUBLE.

DOUBLE

ln(d)

Return the natural logarithm of d, a DOUBLE.

DOUBLE

log10(d)

Return the base-10 logarithm of d, a DOUBLE.

DOUBLE

log2(d)

Return the base-2 logarithm of d, a DOUBLE.

DOUBLE

log(base, d)

Return the base-base logarithm of d, where base and d are DOUBLEs.

DOUBLE

pow(d, p), power(d, p)

Return d raised to the power p, where d and p are DOUBLEs.

DOUBLE

sqrt(d)

Return the square root of d, a DOUBLE.

STRING

bin(i)

Return the STRING representing the binary value of i, a BIGINT.

STRING

hex(i)

Return the STRING representing the hexadecimal value of i, a BIGINT.

STRING

hex(str)

Return the STRING representing the hexadecimal value of s, where each two characters in the STRING s is converted to its hexadecimal representation.

STRING

unhex(i)

The inverse of hex(str).

STRING

conv(i, from_base, to_base)

Return the STRING in base to_base, an INT, representing the value of i, a BIGINT, in base from_base, an INT.

STRING

conv(str, from_base, to_base)

Return the STRING in base to_base, an INT, representing the value of str, a STRING, in base from_base, an INT.

DOUBLE

abs(d)

Return the DOUBLE that is the absolute value of d, a DOUBLE.

INT

pmod(i1, i2)

Return the positive module INT for two INTs, i1 mod i2.

DOUBLE

pmod(d1, d2)

Return the positive module DOUBLE for two DOUBLEs, d1 mod d2.

DOUBLE

sin(d)

Return the DOUBLE that is the sin of d, a DOUBLE, in radians.

DOUBLE

asin(d)

Return the DOUBLE that is the arcsin of d, a DOUBLE, in radians.

DOUBLE

cos(d)

Return the DOUBLE that is the cosine of d, a DOUBLE, in radians.

DOUBLE

acos(d)

Return the DOUBLE that is the arccosine of d, a DOUBLE, in radians.

DOUBLE

tan(d)

Return the DOUBLE that is the tangent of d, a DOUBLE, in radians.

DOUBLE

atan(d)

Return the DOUBLE that is the arctangent of d, a DOUBLE, in radians.

DOUBLE

degrees(d)

Return the DOUBLE that is the value of d, a DOUBLE, converted from radians to degrees.

DOUBLE

radians(d)

Return the DOUBLE that is the value of d, a DOUBLE, converted from degrees to radians.

INT

positive(i)

Return the INT value of i (i.e., it’s effectively the expression +i).

DOUBLE

positive(d)

Return the DOUBLE value of d (i.e., it’s effectively the expression +d).

INT

negative(i)

Return the negative of the INT value of i (i.e., it’s effectively the expression -i).

DOUBLE

negative(d)

Return the negative of the DOUBLE value of d; effectively, the expression -d.

FLOAT

sign(d)

Return the FLOAT value 1.0 if d, a DOUBLE, is positive; return the FLOAT value -1.0 if d is negative; otherwise return 0.0.

DOUBLE

e()

Return the DOUBLE that is the value of the constant e, 2.718281828459045.

DOUBLE

pi()

Return the DOUBLE that is the value of the constant pi, 3.141592653589793.

Note the functions floor, round, and ceil (“ceiling”) for converting DOUBLE to BIGINT, which is floating-point numbers to integer numbers. These functions are the preferred technique, rather than using the cast operator we mentioned above.

Also, there are functions for converting integers to strings in different bases (e.g., hexadecimal).

Aggregate functions

A special kind of function is the aggregate function that returns a single value resulting from some computation over many rows. More precisely, this is the User Defined Aggregate Function, as we’ll see in Aggregate Functions. Perhaps the two best known examples are count, which counts the number of rows (or values for a specific column), and avg, which returns the average value of the specified column values.

Here is a query that counts the number of our example employees and averages their salaries:

hive> SELECT count(*), avg(salary) FROM employees;
4  77500.0

We’ll see other examples when we discuss GROUP BY in the section GROUP BY Clauses.

Table 6-3 lists Hive’s built-in aggregate functions.

Table 6-3. Aggregate functions

Return typeSignatureDescription

BIGINT

count(*)

Return the total number of retrieved rows, including rows containing NULL values.

BIGINT

count(expr)

Return the number of rows for which the supplied expression is not NULL.

BIGINT

count(DISTINCT expr[, expr_.])

Return the number of rows for which the supplied expression(s) are unique and not NULL.

DOUBLE

sum(col)

Return the sum of the values.

DOUBLE

sum(DISTINCT col)

Return the sum of the distinct values.

DOUBLE

avg(col)

Return the average of the values.

DOUBLE

avg(DISTINCT col)

Return the average of the distinct values.

DOUBLE

min(col)

Return the minimum value of the values.

DOUBLE

max(col)

Return the maximum value of the values.

DOUBLE

variance(col), var_pop(col)

Return the variance of a set of numbers in a collection: col.

DOUBLE

var_samp(col)

Return the sample variance of a set of numbers.

DOUBLE

stddev_pop(col)

Return the standard deviation of a set of numbers.

DOUBLE

stddev_samp(col)

Return the sample standard deviation of a set of numbers.

DOUBLE

covar_pop(col1, col2)

Return the covariance of a set of numbers.

DOUBLE

covar_samp(col1, col2)

Return the sample covariance of a set of numbers.

DOUBLE

corr(col1, col2)

Return the correlation of two sets of numbers.

DOUBLE

percentile(int_expr, p)

Return the percentile of int_expr at p (range: [0,1]), where p is a DOUBLE.

ARRAY<DOUBLE>

percentile(int_expr, [p1, ...])

Return the percentiles of int_expr at p (range: [0,1]), where p is a DOUBLE array.

DOUBLE

percentile_approx(int_expr, p , NB)

Return the approximate percentiles of int_expr at p (range: [0,1]), where p is a DOUBLE and NB is the number of histogram bins for estimating (default: 10,000 if not specified).

DOUBLE

percentile_approx(int_expr, [p1, ...] , NB)

Return the approximate percentiles of int_expr at p (range: [0,1]), where p is a DOUBLE array and NB is the number of histogram bins for estimating (default: 10,000 if not specified).

ARRAY<STRUCT {'x','y'}>

histogram_numeric(col, NB)

Return an array of NB histogram bins, where the x value is the center and the y value is the height of the bin.

ARRAY

collect_set(col)

Return a set with the duplicate elements from collection col removed.

You can usually improve the performance of aggregation by setting the following property to true, hive.map.aggr, as shown here:

hive> SET hive.map.aggr=true;

hive> SELECT count(*), avg(salary) FROM employees;

This setting will attempt to do “top-level” aggregation in the map phase, as in this example. (An aggregation that isn’t top-level would be aggregation after performing a GROUP BY.) However, this setting will require more memory.

As Table 6-3 shows, several functions accept DISTINCT … expressions. For example, we could count the unique stock symbols this way:

hive> SELECT count(DISTINCT symbol) FROM stocks;
0

Warning

Wait, zero?? There is a bug when trying to use count(DISTINCT col) when col is a partition column. The answer should be 743 for NASDAQ and NYSE, at least as of early 2010 in the infochimps.org data set we used.

Note that the Hive wiki currently claims that you can’t use more than one function(DISTINCT …) expression in a query. For example, the following is supposed to be disallowed, but it actually works:

hive> SELECT count(DISTINCT ymd), count(DISTINCT volume) FROM stocks;
12110   26144

So, there are 12,110 trading days of data, over 40 years worth.

Table generating functions

The “inverse” of aggregate functions are so-called table generating functions, which take single columns and expand them to multiple columns or rows. We will discuss them extensively in Table Generating Functions, but to complete the contents of this section, we will discuss them briefly now and list the few built-in table generating functions available in Hive.

To explain by way of an example, the following query converts the subordinate array in each employees record into zero or more new records. If an employee record has an empty subordinates array, then no new records are generated. Otherwise, one new record per subordinate is generated:

hive> SELECT explode(subordinates) AS sub FROM employees;
Mary Smith
Todd Jones
Bill King

We used a column alias, sub, defined using the AS sub clause. When using table generating functions, column aliases are required by Hive. There are many other particular details that you must understand to use these functions correctly. We’ll wait until Table Generating Functions to discuss the details.

Table 6-4 lists the built-in table generating functions.

Table 6-4. Table generating functions

Return typeSignatureDescription

N rows

explode(array)

Return 0 to many rows, one row for each element from the input array.

N rows

explode(map)

(v0.8.0 and later) Return 0 to many rows, one row for each map key-value pair, with a field for each map key and a field for the map value.

tuple

json_tuple(jsonStr, p1, p2, …, pn)

Like get_json_object, but it takes multiple names and returns a tuple. All the input parameters and output column types are STRING.

tuple

parse_url_tuple(url, partname1, partname2, …, partnameN) where N >= 1

Extract N parts from a URL. It takes a URL and the partnames to extract, returning a tuple. All the input parameters and output column types are STRING. The valid partnames are case-sensitive and should only contain a minimum of white space: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME>.

N rows

stack(n, col1, …, colM)

Convert M columns into N rows of size M/N each.

Here is an example that uses parse_url_tuple where we assume a url_table exists that contains a column of URLs called url:

SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query)
FROM url_table;

Compare parse_url_tuple with parse_url in Table 6-5 below.

Other built-in functions

Table 6-5 describes the rest of the built-in functions for working with strings, maps, arrays, JSON, and timestamps, with or without the recently introduced TIMESTAMP type (see Primitive Data Types).

Table 6-5. Other built-in functions

Return typeSignatureDescription

BOOLEAN

test in(val1, val2, …)

Return true if test equals one of the values in the list.

INT

length(s)

Return the length of the string.

STRING

reverse(s)

Return a reverse copy of the string.

STRING

concat(s1, s2, …)

Return the string resulting from s1 joined with s2, etc. For example, concat('ab', 'cd') results in 'abcd'. You can pass an arbitrary number of string arguments and the result will contain all of them joined together.

STRING

concat_ws(separator, s1, s2, …)

Like concat, but using the specified separator.

STRING

substr(s, start_index)

Return the substring of s starting from the start_index position, where 1 is the index of the first character, until the end of s. For example, substr('abcd', 3) results in 'cd'.

STRING

substr(s, int start, int length)

Return the substring of s starting from the start position with the given length, e.g., substr('abcdefgh', 3, 2) results in 'cd'.

STRING

upper(s)

Return the string that results from converting all characters of s to upper case, e.g., upper('hIvE') results in 'HIVE'.

STRING

ucase(s)

A synonym for upper().

STRING

lower(s)

Return the string that results from converting all characters of s to lower case, e.g., lower('hIvE') results in 'hive'.

STRING

lcase(s)

A synonym for lower().

STRING

trim(s)

Return the string that results from removing whitespace from both ends of s, e.g., trim(' hive ') results in 'hive'.

STRING

ltrim(s)

Return the string resulting from trimming spaces from the beginning (lefthand side) of s, e.g., ltrim(' hive ') results in 'hive '.

STRING

rtrim(s)

Return the string resulting from trimming spaces from the end (righthand side) of s, e.g., rtrim(' hive ') results in ' hive'.

STRING

regexp_replace(s, regex, replacement)

Return the string resulting from replacing all substrings in s that match the Java regular expression re with replacement.[a] If replacement is blank, the matches are effectively deleted, e.g., regexp_replace('hive', '[ie]', 'z') returns 'hzvz'.

STRING

regexp_extract(subject, regex_pattern, index)

Returns the substring for the index’s match using the regex_pattern.

STRING

parse_url(url, partname, key)

Extracts the specified part from a URL. It takes a URL and the partname to extract. The valid partnames are case-sensitive: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<key>. The optional key is used for the last QUERY:<key> request. Compare with parse_url_tuple described in Table 6-4.

int

size(map<K.V>)

Return the number of elements in the map.

int

size(array<T>)

Return the number of elements in the array.

value of type

cast(<expr> as <type>)

Convert (“cast”) the result of the expression expr to type, e.g., cast('1' as BIGINT) will convert the string '1' to its integral representation. A NULL is returned if the conversion does not succeed.

STRING

from_unixtime(int unixtime)

Convert the number of seconds from the Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of '1970-01-01 00:00:00'.

STRING

to_date(timestamp)

Return the date part of a timestamp string, e.g., to_date("1970-01-01 00:00:00") returns '1970-01-01'.

INT

year(timestamp)

Return the year part as an INT of a timestamp string, e.g., year("1970-11-01 00:00:00") returns 1970.

INT

month(timestamp)

Return the month part as an INT of a timestamp string, e.g., month("1970-11-01 00:00:00") returns 11.

INT

day(timestamp)

Return the day part as an INT of a timestamp string, e.g., day("1970-11-01 00:00:00") returns 1.

STRING

get_json_object(json_string, path)

Extract the JSON object from a JSON string based on the given JSON path, and return the JSON string of the extracted object. NULL is returned if the input JSON string is invalid.

STRING

space(n)

Returns n spaces.

STRING

repeat(s, n)

Repeats s n times.

STRING

ascii(s)

Returns the integer value for the first ASCII character in the string s.

STRING

lpad(s, len, pad)

Returns s exactly len length, prepending instances of the string pad on its left, if necessary, to reach len characters. If s is longer than len, it is truncated.

STRING

rpad(s, len, pad)

Returns s exactly len length, appending instances of the string pad on its right, if necessary, to reach len characters. If s is longer than len, it is truncated.

ARRAY<STRING>

split(s, pattern)

Returns an array of substrings of s, split on occurrences of pattern.

INT

find_in_set(s, commaSeparatedString)

Returns the index of the comma-separated string where s is found, or NULL if it is not found.

INT

locate(substr, str, pos])

Returns the index of str after pos where substr is found.

INT

instr(str, substr)

Returns the index of str where substr is found.

MAP<STRING,STRING>

str_to_map(s, delim1, delim2)

Creates a map by parsing s, using delim1 as the separator between key-value pairs and delim2 as the key-value separator.

ARRAY<ARRAY<STRING>>

sentences(s, lang, locale)

Splits s into arrays of sentences, where each sentence is an array of words. The lang and country arguments are optional; if omitted, the default locale is used.

ARRAY<STRUCT <STRING,DOUBLE>>

ngrams(array<array<string>>, N, K, pf)

Estimates the top-K n-grams in the text. pf is the precision factor.

ARRAY<STRUCT <STRING,DOUBLE>>

context_ngrams(array<array<string>>,array<string>,int K, int pf)

Like ngrams, but looks for n-grams that begin with the second array of words in each outer array.

BOOLEAN

in_file(s, filename)

Returns true if s appears in the file named filename.

[a] See http://docs.oracle.com/javase/tutorial/essential/regex/ for more on Java regular expression syntax.

Note that the time-related functions (near the end of the table) take integer or string arguments. As of Hive v0.8.0, these functions also take TIMESTAMP arguments, but they will continue to take integer or string arguments for backwards compatibility.

LIMIT Clause

The results of a typical query can return a large number of rows. The LIMIT clause puts an upper limit on the number of rows returned:

hive> SELECT upper(name), salary, deductions["Federal Taxes"],
    > round(salary * (1 - deductions["Federal Taxes"])) FROM employees
    > LIMIT 2;
JOHN DOE    100000.0  0.2   80000
MARY SMITH   80000.0  0.2   64000

Column Aliases

You can think of the previous example query as returning a new relation with new columns, some of which are anonymous results of manipulating columns in employees. It’s sometimes useful to give those anonymous columns a name, called a column alias. Here is the previous query with column aliases for the third and fourth columns returned by the query, fed_taxes and salary_minus_fed_taxes, respectively:

hive> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
    > round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
    > FROM employees LIMIT 2;
JOHN DOE    100000.0  0.2   80000
MARY SMITH   80000.0  0.2   64000

Nested SELECT Statements

The column alias feature is especially useful in nested select statements. Let’s use the previous example as a nested query:

hive> FROM (
    >   SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
    >   round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
    >   FROM employees
    > ) e
    > SELECT e.name, e.salary_minus_fed_taxes
    > WHERE e.salary_minus_fed_taxes > 70000;
JOHN DOE    100000.0  0.2   80000

The previous result set is aliased as e, from which we perform a second query to select the name and the salary_minus_fed_taxes, where the latter is greater than 70,000. (We’ll cover WHERE clauses in WHERE Clauses below.)

CASE … WHEN … THEN Statements

The CASE … WHEN … THEN clauses are like if statements for individual columns in query results. For example:

hive> SELECT name, salary,
    >   CASE
    >     WHEN salary <  50000.0 THEN 'low'
    >     WHEN salary >= 50000.0 AND salary <  70000.0 THEN 'middle'
    >     WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
    >     ELSE 'very high'
    >   END AS bracket FROM employees;
John Doe         100000.0   very high
Mary Smith        80000.0   high
Todd Jones        70000.0   high
Bill King         60000.0   middle
Boss Man         200000.0   very high
Fred Finance     150000.0   very high
Stacy Accountant  60000.0   middle
...

When Hive Can Avoid MapReduce

If you have been running the queries in this book so far, you have probably noticed that a MapReduce job is started in most cases. Hive implements some kinds of queries without using MapReduce, in so-called local mode, for example:

SELECT * FROM employees;

In this case, Hive can simply read the records from employees and dump the formatted output to the console.

This even works for WHERE clauses that only filter on partition keys, with or without LIMIT clauses:

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;

Furthermore, Hive will attempt to run other operations in local mode if the hive.exec.mode.local.auto property is set to true:

set hive.exec.mode.local.auto=true;

Otherwise, Hive uses MapReduce to run all other queries.

Note

Trust us, you want to add set hive.exec.mode.local.auto=true; to your $HOME/.hiverc file.

WHERE Clauses

While SELECT clauses select columns, WHERE clauses are filters; they select which records to return. Like SELECT clauses, we have already used many simple examples of WHERE clauses before defining the clause, on the assumption you have seen them before. Now we’ll explore them in a bit more detail.

WHERE clauses use predicate expressions, applying predicate operators, which we’ll describe in a moment, to columns. Several predicate expressions can be joined with AND and OR clauses. When the predicate expressions evaluate to true, the corresponding rows are retained in the output.

We just used the following example that restricts the results to employees in the state of California:

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA';

The predicates can reference the same variety of computations over column values that can be used in SELECT clauses. Here we adapt our previously used query involving Federal Taxes, filtering for those rows where the salary minus the federal taxes is greater than 70,000:

hive> SELECT name, salary, deductions["Federal Taxes"],
    >   salary * (1 - deductions["Federal Taxes"])
    > FROM employees
    > WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
John Doe    100000.0  0.2   80000.0

This query is a bit ugly, because the complex expression on the second line is duplicated in the WHERE clause. The following variation eliminates the duplication, using a column alias, but unfortunately it’s not valid:

hive>  SELECT name, salary, deductions["Federal Taxes"],
    >    salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
    >  FROM employees
    >  WHERE round(salary_minus_fed_taxes) > 70000;
FAILED: Error in semantic analysis: Line 4:13 Invalid table alias or
column reference 'salary_minus_fed_taxes': (possible column names are:
name, salary, subordinates, deductions, address)

As the error message says, we can’t reference column aliases in the WHERE clause. However, we can use a nested SELECT statement:

hive> SELECT e.* FROM
    > (SELECT name, salary, deductions["Federal Taxes"] as ded,
    >    salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
    >  FROM employees) e
    > WHERE round(e.salary_minus_fed_taxes) > 70000;
John Doe        100000.0        0.2     80000.0
Boss Man        200000.0        0.3     140000.0
Fred Finance    150000.0        0.3     105000.0

Predicate Operators

Table 6-6 describes the predicate operators, which are also used in JOIN … ON and HAVING clauses.

Table 6-6. Predicate operators

OperatorTypesDescription

A = B

Primitive types

True if A equals B. False otherwise.

A <> B, A != B

Primitive types

NULL if A or B is NULL; true if A is not equal to B; false otherwise.

A < B

Primitive types

NULL if A or B is NULL; true if A is less than B; false otherwise.

A <= B

Primitive types

NULL if A or B is NULL; true if A is less than or equal to B; false otherwise.

A > B

Primitive types

NULL if A or B is NULL; true if A is greater than B; false otherwise.

A >= B

Primitive types

NULL if A or B is NULL; true if A is greater than or equal to B; false otherwise.

A IS NULL

All types

True if A evaluates to NULL; false otherwise.

A IS NOT NULL

All types

False if A evaluates to NULL; true otherwise.

A LIKE B

String

True if A matches the SQL simplified regular expression specification given by B; false otherwise. B is interpreted as follows: 'x%' means A must begin with the prefix 'x', '%x' means A must end with the suffix 'x', and '%x%' means A must begin with, end with, or contain the substring 'x'. Similarly, the underscore '_' matches a single character. B must match the whole string A.

A RLIKE B, A REGEXP B

String

True if A matches the regular expression given by B; false otherwise. Matching is done by the JDK regular expression library and hence it follows the rules of that library. For example, the regular expression must match the entire string A, not just a subset. See below for more information about regular expressions.

We’ll discuss LIKE and RLIKE in detail below (LIKE and RLIKE). First, let’s point out an issue with comparing floating-point numbers that you should understand.

Gotchas with Floating-Point Comparisons

A common gotcha arises when you compare floating-point numbers of different types (i.e., FLOAT versus DOUBLE). Consider the following query of the employees table, which is designed to return the employee’s name, salary, and federal taxes deduction, but only if that tax deduction exceeds 0.2 (20%) of his or her salary:

hive> SELECT name, salary, deductions['Federal Taxes']
    > FROM employees WHERE deductions['Federal Taxes'] > 0.2;
John Doe        100000.0        0.2
Mary Smith      80000.0         0.2
Boss Man        200000.0        0.3
Fred Finance    150000.0        0.3

Wait! Why are records with deductions['Federal Taxes'] = 0.2 being returned?

Is it a Hive bug? There is a bug filed against Hive for this issue, but it actually reflects the behavior of the internal representation of floating-point numbers when they are compared and it affects almost all software written in most languages on all modern digital computers (see https://issues.apache.org/jira/browse/HIVE-2586).

When you write a floating-point literal value like 0.2, Hive uses a DOUBLE to hold the value. We defined the deductions map values to be FLOAT, which means that Hive will implicitly convert the tax deduction value to DOUBLE to do the comparison. This should work, right?

Actually, it doesn’t work. Here’s why. The number 0.2 can’t be represented exactly in a FLOAT or DOUBLE. (See http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html for an in-depth discussion of floating-point number issues.) In this particular case, the closest exact value is just slightly greater than 0.2, with a few nonzero bits at the least significant end of the number.

To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and 0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive, it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001. That’s why the query results appear to use >= not >!

This issue is not unique to Hive nor Java, in which Hive is implemented. Rather, it’s a general problem for all systems that use the IEEE standard for encoding floating-point numbers!

However, there are two workarounds we can use in Hive.

First, if we read the data from a TEXTFILE (see Chapter 15), which is what we have been assuming so far, then Hive reads the string “0.2” from the data file and converts it to a real number. We could use DOUBLE instead of FLOAT in our schema. Then we would be comparing a DOUBLE for the deductions['Federal Taxes'] with a double for the literal 0.2. However, this change will increase the memory footprint of our queries. Also, we can’t simply change the schema like this if the data file is a binary file format like SEQUENCEFILE (discussed in Chapter 15).

The second workaround is to explicitly cast the 0.2 literal value to FLOAT. Java has a nice way of doing this: you append the letter f or F to the end of the number (e.g., 0.2f). Unfortunately, Hive doesn’t support this syntax; we have to use the cast operator.

Here is a modified query that casts the 0.2 literal value to FLOAT. With this change, the expected results are returned by the query:

hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
    > WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);
Boss Man        200000.0        0.3
Fred Finance    150000.0        0.3

Note the syntax inside the cast operator: number AS FLOAT.

Actually, there is also a third solution: avoid floating-point numbers for anything involving money.

Warning

Use extreme caution when comparing floating-point numbers. Avoid all implicit casts from smaller to wider types.

LIKE and RLIKE

Table 6-6 describes the LIKE and RLIKE predicate operators. You have probably seen LIKE before, a standard SQL operator. It lets us match on strings that begin with or end with a particular substring, or when the substring appears anywhere within the string.

For example, the following three queries select the employee names and addresses where the street ends with Ave., the city begins with O, and the street contains Chicago:

hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Ave.';
John Doe        1 Michigan Ave.
Todd Jones      200 Chicago Ave.

hive> SELECT name, address.city FROM employees WHERE address.city LIKE 'O%';
Todd Jones      Oak Park
Bill King       Obscuria

hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%';
Todd Jones      200 Chicago Ave.

A Hive extension is the RLIKE clause, which lets us use Java regular expressions, a more powerful minilanguage for specifying matches. The rich details of regular expression syntax and features are beyond the scope of this book. The entry for RLIKE in Table 6-6 provides links to resources with more details on regular expressions. Here, we demonstrate their use with an example, which finds all the employees whose street contains the word Chicago or Ontario:

hive> SELECT name, address.street
    > FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';
Mary Smith      100 Ontario St.
Todd Jones      200 Chicago Ave.

The string after the RLIKE keyword has the following interpretation. A period (.) matches any character and a star (*) means repeat the “thing to the left” (period, in the two cases shown) zero to many times. The expression (x|y) means match either x or y.

Hence, there might be no characters before “Chicago” or “Ontario” and there might be no characters after them. Of course, we could have written this particular example with two LIKE clauses:

SELECT name, address FROM employees
WHERE address.street LIKE '%Chicago%' OR address.street LIKE '%Ontario%';

General regular expression matches will let us express much richer matching criteria that would become very unwieldy with joined LIKE clauses such as these.

For more details about regular expressions as implemented by Hive using Java, see the documentation for the Java regular expression syntax at http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html or see Regular Expression Pocket Reference by Tony Stubblebine (O’Reilly), Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (O’Reilly), or Mastering Regular Expressions, 3rd Edition, by Jeffrey E.F. Friedl (O’Reilly).

GROUP BY Clauses

The GROUP BY statement is often used in conjunction with aggregate functions to group the result set by one or more columns and then perform an aggregation over each group.

Let’s return to the stocks table we defined in External Tables. The following query groups stock records for Apple by year, then averages the closing price for each year:

hive> SELECT year(ymd), avg(price_close) FROM stocks
    > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
    > GROUP BY year(ymd);
1984    25.578625440597534
1985    20.193676221040867
1986    32.46102808021274
1987    53.88968399108163
1988    41.540079275138766
1989    41.65976212516664
1990    37.56268799823263
1991    52.49553383386182
1992    54.80338610251119
1993    41.02671956450572
1994    34.0813495847914
...

HAVING Clauses

The HAVING clause lets you constrain the groups produced by GROUP BY in a way that could be expressed with a subquery, using a syntax that’s easier to express. Here’s the previous query with an additional HAVING clause that limits the results to years where the average closing price was greater than $50.0:

hive> SELECT year(ymd), avg(price_close) FROM stocks
    > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
    > GROUP BY year(ymd)
        > HAVING avg(price_close) > 50.0;
1987    53.88968399108163
1991    52.49553383386182
1992    54.80338610251119
1999    57.77071460844979
2000    71.74892876261757
2005    52.401745992993554
...

Without the HAVING clause, this query would require a nested SELECT statement:

hive> SELECT s2.year, s2.avg FROM
    > (SELECT year(ymd) AS year, avg(price_close) AS avg FROM stocks
    > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
    > GROUP BY year(ymd)) s2
    > WHERE s2.avg > 50.0;
1987    53.88968399108163
...

JOIN Statements

Hive supports the classic SQL JOIN statement, but only equi-joins are supported.

Inner JOIN

In an inner JOIN, records are discarded unless join criteria finds matching records in every table being joined. For example, the following query compares Apple (symbol AAPL) and IBM (symbol IBM). The stocks table is joined against itself, a self-join, where the dates, ymd (year-month-day) values must be equal in both tables. We say that the ymd columns are the join keys in this query:

hive> SELECT a.ymd, a.price_close, b.price_close
    > FROM stocks a JOIN stocks b ON a.ymd = b.ymd
    > WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';
2010-01-04      214.01  132.45
2010-01-05      214.38  130.85
2010-01-06      210.97  130.0
2010-01-07      210.58  129.55
2010-01-08      211.98  130.85
2010-01-11      210.11  129.48
...

The ON clause specifies the conditions for joining records between the two tables. The WHERE clause limits the lefthand table to AAPL records and the righthand table to IBM records. You can also see that using table aliases for the two occurrences of stocks is essential in this query.

As you may know, IBM is an older company than Apple. It has been a publicly traded stock for much longer than Apple. However, since this is an inner JOIN, no IBM records will be returned older than September 7, 1984, which was the first day that Apple was publicly traded!

Standard SQL allows a non-equi-join on the join keys, such as the following example that shows Apple versus IBM, but with all older records for Apple paired up with each day of IBM data. It would be a lot of data (Example 6-1)!

Example 6-1. Query that will not work in Hive

SELECT a.ymd, a.price_close, b.price_close
FROM stocks a JOIN stocks b
ON a.ymd <= b.ymd
WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

This is not valid in Hive, primarily because it is difficult to implement these kinds of joins in MapReduce. It turns out that Pig offers a cross product feature that makes it possible to implement this join, even though Pig’s native join feature doesn’t support it, either.

Also, Hive does not currently support using OR between predicates in ON clauses.

To see a nonself join, let’s introduce the corresponding dividends data, also available from infochimps.org, as described in External Tables:

CREATE EXTERNAL TABLE IF NOT EXISTS dividends (
  ymd             STRING,
  dividend        FLOAT
)
PARTITIONED BY (exchange STRING, symbol STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Here is an inner JOIN between stocks and dividends for Apple, where we use the ymd and symbol columns as join keys:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
    > WHERE s.symbol = 'AAPL';
1987-05-11      AAPL    77.0    0.015
1987-08-10      AAPL    48.25   0.015
1987-11-17      AAPL    35.0    0.02
...
1995-02-13      AAPL    43.75   0.03
1995-05-26      AAPL    42.69   0.03
1995-08-16      AAPL    44.5    0.03
1995-11-21      AAPL    38.63   0.03

Yes, Apple paid a dividend years ago and only recently announced it would start doing so again! Note that because we have an inner JOIN, we only see records approximately every three months, the typical schedule of dividend payments, which are announced when reporting quarterly results.

You can join more than two tables together. Let’s compare Apple, IBM, and GE side by side:

hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
    > FROM stocks a JOIN stocks b ON a.ymd = b.ymd
    >               JOIN stocks c ON a.ymd = c.ymd
    > WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
2010-01-04      214.01  132.45  15.45
2010-01-05      214.38  130.85  15.53
2010-01-06      210.97  130.0   15.45
2010-01-07      210.58  129.55  16.25
2010-01-08      211.98  130.85  16.6
2010-01-11      210.11  129.48  16.76
...

Most of the time, Hive will use a separate MapReduce job for each pair of things to join. In this example, it would use one job for tables a and b, then a second job to join the output of the first join with c.

Note

Why not join b and c first? Hive goes from left to right.

However, this example actually benefits from an optimization we’ll discuss next.

Join Optimizations

In the previous example, every ON clause uses a.ymd as one of the join keys. In this case, Hive can apply an optimization where it joins all three tables in a single MapReduce job. The optimization would also be used if b.ymd were used in both ON clauses.

Note

When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used.

Hive also assumes that the last table in the query is the largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on individual records. Therefore, you should structure your join queries so the largest table is last.

Recall our previous join between stocks and dividends. We actually made the mistake of using the smaller dividends table last:

SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

We should switch the positions of stocks and dividends:

SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM dividends d JOIN stocks s ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

It turns out that these data sets are too small to see a noticeable performance difference, but for larger data sets, you’ll want to exploit this optimization.

Fortunately, you don’t have to put the largest table last in the query. Hive also provides a “hint” mechanism to tell the query optimizer which table should be streamed:

SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

Now Hive will attempt to stream the stocks table, even though it’s not the last table in the query.

There is another important optimization called map-side joins that we’ll return to in Map-side Joins.

LEFT OUTER JOIN

The left-outer join is indicated by adding the LEFT OUTER keywords:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
    > WHERE s.symbol = 'AAPL';
...
1987-05-01      AAPL    80.0    NULL
1987-05-04      AAPL    79.75   NULL
1987-05-05      AAPL    80.25   NULL
1987-05-06      AAPL    80.0    NULL
1987-05-07      AAPL    80.25   NULL
1987-05-08      AAPL    79.0    NULL
1987-05-11      AAPL    77.0    0.015
1987-05-12      AAPL    75.5    NULL
1987-05-13      AAPL    78.5    NULL
1987-05-14      AAPL    79.25   NULL
1987-05-15      AAPL    78.25   NULL
1987-05-18      AAPL    75.75   NULL
1987-05-19      AAPL    73.25   NULL
1987-05-20      AAPL    74.5    NULL
...

In this join, all the records from the lefthand table that match the WHERE clause are returned. If the righthand table doesn’t have a record that matches the ON criteria, NULL is used for each column selected from the righthand table.

Hence, in this result set, we see that the every Apple stock record is returned and the d.dividend value is usually NULL, except on days when a dividend was paid (May 11th, 1987, in this output).

OUTER JOIN Gotcha

Before we discuss the other outer joins, let’s discuss a gotcha you should understand.

Recall what we said previously about speeding up queries by adding partition filters in the WHERE clause. To speed up our previous query, we might choose to add predicates that select on the exchange in both tables:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
    > WHERE s.symbol = 'AAPL'
    > AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
1987-05-11      AAPL    77.0    0.015
1987-08-10      AAPL    48.25   0.015
1987-11-17      AAPL    35.0    0.02
1988-02-12      AAPL    41.0    0.02
1988-05-16      AAPL    41.25   0.02
...

However, the output has changed, even though we thought we were just adding an optimization! We’re back to having approximately four stock records per year and we have non-NULL entries for all the dividend values. In other words, we are back to the original inner join!

This is actually common behavior for all outer joins in most SQL implementations. It occurs because the JOIN clause is evaluated first, then the results are passed through the WHERE clause. By the time the WHERE clause is reached, d.exchange is NULL most of the time, so the “optimization” actually filters out all records except those on the day of dividend payments.

One solution is straightforward; remove the clauses in the WHERE clause that reference the dividends table:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
    > WHERE s.symbol = 'AAPL' AND s.exchange = 'NASDAQ';
...
1987-05-07      AAPL    80.25   NULL
1987-05-08      AAPL    79.0    NULL
1987-05-11      AAPL    77.0    0.015
1987-05-12      AAPL    75.5    NULL
1987-05-13      AAPL    78.5    NULL
...

This isn’t very satisfactory. You might wonder if you can move the predicates from the WHERE clause into the ON clause, at least the partition filters. This does not work for outer joins, despite documentation on the Hive Wiki that claims it should work (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins).

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s LEFT OUTER JOIN dividends d
    > ON s.ymd = d.ymd AND s.symbol = d.symbol
    > AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
...
1962-01-02      GE      74.75   NULL
1962-01-02      IBM     572.0   NULL
1962-01-03      GE      74.0    NULL
1962-01-03      IBM     577.0   NULL
1962-01-04      GE      73.12   NULL
1962-01-04      IBM     571.25  NULL
1962-01-05      GE      71.25   NULL
1962-01-05      IBM     560.0   NULL
...

The partition filters are ignored for OUTER JOINTS. However, using such filter predicates in ON clauses for inner joins does work!

Fortunately, there is solution that works for all joins; use nested SELECT statements:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM
    > (SELECT * FROM stocks WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') s
    > LEFT OUTER JOIN
    > (SELECT * FROM dividends WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') d
    > ON s.ymd = d.ymd;
...
1988-02-10      AAPL    41.0    NULL
1988-02-11      AAPL    40.63   NULL
1988-02-12      AAPL    41.0    0.02
1988-02-16      AAPL    41.25   NULL
1988-02-17      AAPL    41.88   NULL
...

The nested SELECT statement performs the required “push down” to apply the partition filters before data is joined.

Warning

WHERE clauses are evaluated after joins are performed, so WHERE clauses should use predicates that only filter on column values that won’t be NULL. Also, contrary to Hive documentation, partition filters don’t work in ON clauses for OUTER JOINS, although they do work for INNER JOINS!

RIGHT OUTER JOIN

Right-outer joins return all records in the righthand table that match the WHERE clause. NULL is used for fields of missing records in the lefthand table.

Here we switch the places of stocks and dividends and perform a righthand join, but leave the SELECT statement unchanged:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
    > WHERE s.symbol = 'AAPL';
...
1987-05-07      AAPL    80.25   NULL
1987-05-08      AAPL    79.0    NULL
1987-05-11      AAPL    77.0    0.015
1987-05-12      AAPL    75.5    NULL
1987-05-13      AAPL    78.5    NULL
...

FULL OUTER JOIN

Finally, a full-outer join returns all records from all tables that match the WHERE clause. NULL is used for fields in missing records in either table.

If we convert the previous query to a full-outer join, we’ll actually get the same results, since there is never a case where a dividend record exists without a matching stock record:

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
    > WHERE s.symbol = 'AAPL';
...
1987-05-07      AAPL    80.25   NULL
1987-05-08      AAPL    79.0    NULL
1987-05-11      AAPL    77.0    0.015
1987-05-12      AAPL    75.5    NULL
1987-05-13      AAPL    78.5    NULL
...

LEFT SEMI-JOIN

A left semi-join returns records from the lefthand table if records are found in the righthand table that satisfy the ON predicates. It’s a special, optimized case of the more general inner join. Most SQL dialects support an IN ... EXISTS construct to do the same thing. For instance, the following query in Example 6-2 attempts to return stock records only on the days of dividend payments, but it doesn’t work in Hive.

Example 6-2. Query that will not work in Hive

SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);

Instead, you use the following LEFT SEMI JOIN syntax:

hive> SELECT s.ymd, s.symbol, s.price_close
    > FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
...
1962-11-05      IBM     361.5
1962-08-07      IBM     373.25
1962-05-08      IBM     459.5
1962-02-06      IBM     551.5

Note that the SELECT and WHERE clauses can’t reference columns from the righthand table.

Caution

Right semi-joins are not supported in Hive.

The reason semi-joins are more efficient than the more general inner join is as follows. For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table as soon as any match is found. At that point, the selected columns from the lefthand table record can be projected.

Cartesian Product JOINs

A Cartesian product is a join where all the tuples in the left side of the join are paired with all the tuples of the right table. If the left table has 5 rows and the right table has 6 rows, 30 rows of output will be produced:

SELECTS * FROM stocks JOIN dividends;

Using the table of stocks and dividends, it is hard to find a reason for a join of this type, as the dividend of one stock is not usually paired with another. Additionally, Cartesian products create a lot of data. Unlike other join types, Cartesian products are not executed in parallel, and they are not optimized in any way using MapReduce.

It is critical to point out that using the wrong join syntax will cause a long, slow-running Cartesian product query. For example, the following query will be optimized to an inner join in many databases, but not in Hive:

hive > SELECT * FROM stocks JOIN dividends
     > WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';

In Hive, this query computes the full Cartesian product before applying the WHERE clause. It could take a very long time to finish. When the property hive.mapred.mode is set to strict, Hive prevents users from inadvertently issuing a Cartesian product query. We’ll discuss the features of strict mode more extensively in Chapter 10.

Note

Cartesian product queries can be useful. For example, suppose there is a table of user preferences, a table of news articles, and an algorithm that predicts which articles a user would like to read. A Cartesian product is required to generate the set of all users and all pages.

Map-side Joins

If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.

The joins between stocks and dividends can exploit this optimization, as the dividends data set is small enough to be cached.

Before Hive v0.7, it was necessary to add a hint to the query to enable this optimization. Returning to our inner join example:

SELECT /*+ MAPJOIN(d) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

Running this query versus the original on a fast MacBook Pro laptop yielded times of approximately 23 seconds versus 33 seconds for the original unoptimized query, which is roughly 30% faster using our sample stock data.

The hint still works, but it’s now deprecated as of Hive v0.7. However, you still have to set a property, hive.auto.convert.join, to true before Hive will attempt the optimization. It’s false by default:

hive> set hive.auto.convert.join=true;

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
    > FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
    > WHERE s.symbol = 'AAPL';

Note that you can also configure the threshold size for table files considered small enough to use this optimization. Here is the default definition of the property (in bytes):

hive.mapjoin.smalltable.filesize=25000000

If you always want Hive to attempt this optimization, set one or both of these properties in your $HOME/.hiverc file.

Hive does not support the optimization for right- and full-outer joins.

This optimization can also be used for larger tables under certain conditions when the data for every table is bucketed, as discussed in Bucketing Table Data Storage. Briefly, the data must be bucketed on the keys used in the ON clause and the number of buckets for one table must be a multiple of the number of buckets for the other table. When these conditions are met, Hive can join individual buckets between tables in the map phase, because it does not need to fetch the entire contents of one table to match against each bucket in the other table.

However, this optimization is not turned on by default. It must be enabled by setting the property hive.optimize.bucketmapjoin:

set hive.optimize.bucketmapjoin=true;

If the bucketed tables actually have the same number of buckets and the data is sorted by the join/bucket keys, then Hive can perform an even faster sort-merge join. Once again, properties must be set to enable the optimization:

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;

ORDER BY and SORT BY

The ORDER BY clause is familiar from other SQL dialects. It performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. Better performance is traded for total ordering.

In both cases, the syntax differs only by the use of the ORDER or SORT keyword. You can specify any columns you wish and specify whether or not the columns are ascending using the ASC keyword (the default) or descending using the DESC keyword.

Here is an example using ORDER BY:

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
ORDER BY s.ymd ASC, s.symbol DESC;

Here is the same example using SORT BY instead:

SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY s.ymd ASC, s.symbol DESC;

The two queries look almost identical, but if more than one reducer is invoked, the output will be sorted differently. While each reducer’s output files will be sorted, the data will probably overlap with the output of other reducers.

Because ORDER BY can result in excessively long run times, Hive will require a LIMIT clause with ORDER BY if the property hive.mapred.mode is set to strict. By default, it is set to nonstrict.

DISTRIBUTE BY with SORT BY

DISTRIBUTE BY controls how map output is divided among reducers. All data that flows through a MapReduce job is organized into key-value pairs. Hive must use this feature internally when it converts your queries to MapReduce jobs.

Usually, you won’t need to worry about this feature. The exceptions are queries that use the Streaming feature (see Chapter 14) and some stateful UDAFs (User-Defined Aggregate Functions; see Aggregate Functions). There is one other scenario where these clauses are useful.

By default, MapReduce computes a hash on the keys output by mappers and tries to evenly distribute the key-value pairs among the available reducers using the hash values. Unfortunately, this means that when we use SORT BY, the contents of one reducer’s output will overlap significantly with the output of the other reducers, as far as sorted order is concerned, even though the data is sorted within each reducer’s output.

Say we want the data for each stock symbol to be captured together. We can use DISTRIBUTE BY to ensure that the records for each stock symbol go to the same reducer, then use SORT BY to order the data the way we want. The following query demonstrates this technique:

hive> SELECT s.ymd, s.symbol, s.price_close
    > FROM stocks s
    > DISTRIBUTE BY s.symbol
    > SORT BY  s.symbol ASC, s.ymd ASC;
1984-09-07  AAPL  26.5
1984-09-10  AAPL  26.37
1984-09-11  AAPL  26.87
1984-09-12  AAPL  26.12
1984-09-13  AAPL  27.5
1984-09-14  AAPL  27.87
1984-09-17  AAPL  28.62
1984-09-18  AAPL  27.62
1984-09-19  AAPL  27.0
1984-09-20  AAPL  27.12
...

Of course, the ASC keywords could have been omitted as they are the defaults. The ASC keyword is placed here for reasons that will be described shortly.

DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers receive rows for processing, while SORT BY controls the sorting of data inside the reducer.

Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.

CLUSTER BY

In the previous example, the s.symbol column was used in the DISTRIBUTE BY clause, and the s.symbol and the s.ymd columns in the SORT BY clause. Suppose that the same columns are used in both clauses and all columns are sorted by ascending order (the default). In this case, the CLUSTER BY clause is a shor-hand way of expressing the same query.

For example, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER BY on s.symbol:

hive> SELECT s.ymd, s.symbol, s.price_close
    > FROM stocks s
    > CLUSTER BY s.symbol;
2010-02-08  AAPL  194.12
2010-02-05  AAPL  195.46
2010-02-04  AAPL  192.05
2010-02-03  AAPL  199.23
2010-02-02  AAPL  195.86
2010-02-01  AAPL  194.73
2010-01-29  AAPL  192.06
2010-01-28  AAPL  199.29
2010-01-27  AAPL  207.88
...

Because the sort requirements are removed for the s.ymd, the output reflects the original order of the stock data, which is sorted descending.

Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files.

Casting

We briefly mentioned in Primitive Data Types that Hive will perform some implicit conversions, called casts, of numeric data types, as needed. For example, when doing comparisons between two numbers of different types. This topic is discussed more fully in Predicate Operators and Gotchas with Floating-Point Comparisons.

Here we discuss the cast() function that allows you to explicitly convert a value of one type to another.

Recall our employees table uses a FLOAT for the salary column. Now, imagine for a moment that STRING was used for that column instead. How could we work with the values as FLOATS?

The following example casts the values to FLOAT before performing a comparison:

SELECT name, salary FROM employees
WHERE cast(salary AS FLOAT) < 100000.0;

The syntax of the cast function is cast(value AS TYPE). What would happen in the example if a salary value was not a valid string for a floating-point number? In this case, Hive returns NULL.

Note that the preferred way to convert floating-point numbers to integers is to use the round() or floor() functions listed in Table 6-2, rather than to use the cast operator.

Casting BINARY Values

The new BINARY type introduced in Hive v0.8.0 only supports casting BINARY to STRING. However, if you know the value is a number, you can nest cast() invocations, as in this example where column b is a BINARY column:

SELECT (2.0*cast(cast(b as string) as double)) from src;

You can also cast STRING to BINARY.

Queries that Sample Data

For very large data sets, sometimes you want to work with a representative sample of a query result, not the whole thing. Hive supports this goal with queries that sample tables organized into buckets.

In the following example, assume the numbers table has one number column with values 1−10.

We can sample using the rand() function, which returns a random number. In the first two queries, two distinct numbers are returned for each query. In the third query, no results are returned:

hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
2
4

hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
7
10

hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;

If we bucket on a column instead of rand(), then identical results are returned on multiple runs:

hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s;
2

hive> SELECT * from numbers TABLESAMPLE(BUCKET 5 OUT OF 10 ON number) s;
4

hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s;
2

The denominator in the bucket clause represents the number of buckets into which data will be hashed. The numerator is the bucket number selected:

hive> SELECT * from numbers TABLESAMPLE(BUCKET 1 OUT OF 2 ON number) s;
2
4
6
8
10

hive> SELECT * from numbers TABLESAMPLE(BUCKET 2 OUT OF 2 ON number) s;
1
3
5
7
9

Block Sampling

Hive offers another syntax for sampling a percentage of blocks of an input path as an alternative to sampling based on rows:

hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;

Warning

This sampling is not known to work with all file formats. Also, the smallest unit of sampling is a single HDFS block. Hence, for tables less than the typical block size of 128 MB, all rows will be retuned.

Percentage-based sampling offers a variable to control the seed information for block-based tuning. Different seeds produce different samples:

<property>
  <name>hive.sample.seednumber</name>
  <value>0</value>
  <description>A number used for percentage sampling. By changing this
  number, user will change the subsets of data sampled.</description>
</property>

Input Pruning for Bucket Tables

From a first look at the TABLESAMPLE syntax, an astute user might come to the conclusion that the following query would be equivalent to the TABLESAMPLE operation:

hive> SELECT * FROM numbersflat WHERE number % 2 = 0;
2
4
6
8
10

It is true that for most table types, sampling scans through the entire table and selects every Nth row. However, if the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE queries only scan the required hash partitions of the table:

hive> CREATE TABLE numbers_bucketed (number int) CLUSTERED BY (number) INTO 3 BUCKETS;

hive> SET hive.enforce.bucketing=true;

hive> INSERT OVERWRITE TABLE numbers_bucketed SELECT number FROM numbers;

hive> dfs -ls /user/hive/warehouse/mydb.db/numbers_bucketed;
/user/hive/warehouse/mydb.db/numbers_bucketed/000000_0
/user/hive/warehouse/mydb.db/numbers_bucketed/000001_0
/user/hive/warehouse/mydb.db/numbers_bucketed/000002_0

hive> dfs -cat /user/hive/warehouse/mydb.db/numbers_bucketed/000001_0;
1
7
10
4

Because this table is clustered into three buckets, the following query can be used to sample only one of the buckets efficiently:

hive> SELECT * FROM numbers_bucketed TABLESAMPLE (BUCKET 2 OUT OF 3 ON NUMBER) s;
1
7
10
4

UNION ALL

UNION ALL combines two or more tables. Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position. For example, if the second column is a FLOAT, then the second column of all the other query results must be a FLOAT.

Here is an example the merges log data:

SELECT log.ymd, log.level, log.message
  FROM (
    SELECT l1.ymd, l1.level,
      l1.message, 'Log1' AS source
    FROM log1 l1
  UNION ALL
    SELECT l2.ymd, l2.level,
      l2.message, 'Log2' AS source
    FROM log1 l2
  ) log
SORT BY log.ymd ASC;

UNION may be used when a clause selects from the same source table. Logically, the same results could be achieved with a single SELECT and WHERE clause. This technique increases readability by breaking up a long complex WHERE clause into two or more UNION queries. However, unless the source table is indexed, the query will have to make multiple passes over the same source data. For example:

FROM (
  FROM src SELECT src.key, src.value WHERE src.key < 100
  UNION ALL
  FROM src SELECT src.* WHERE src.key > 110
) unioninput
INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*


[17] At the time of this writing, the Hive Wiki shows an incorrect syntax for specifying columns using regular expressions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset