Chapter 3. Data Types and File Formats

Hive supports many of the primitive data types you find in relational databases, as well as three collection data types that are rarely found in relational databases, for reasons we’ll discuss shortly.

A related concern is how these types are represented in text files, as well as alternatives to text storage that address various performance and other concerns. A unique feature of Hive, compared to most databases, is that it provides great flexibility in how data is encoded in files. Most databases take total control of the data, both how it is persisted to disk and its life cycle. By letting you control all these aspects, Hive makes it easier to manage and process data with a variety of tools.

Primitive Data Types

Hive supports several sizes of integer and floating-point types, a Boolean type, and character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary fields.

Table 3-1 lists the primitive types supported by Hive.

Table 3-1. Primitive data types

TypeSizeLiteral syntax examples

TINYINT

1 byte signed integer.

20

SMALLINT

2 byte signed integer.

20

INT

4 byte signed integer.

20

BIGINT

8 byte signed integer.

20

BOOLEAN

Boolean true or false.

TRUE

FLOAT

Single precision floating point.

3.14159

DOUBLE

Double precision floating point.

3.14159

STRING

Sequence of characters. The character set can be specified. Single or double quotes can be used.

'Now is the time', "for all good men"

TIMESTAMP (v0.8.0+)

Integer, float, or string.

1327882394 (Unix epoch seconds), 1327882394.123456789 (Unix epoch seconds plus nanoseconds), and '2012-02-03 12:34:56.123456789' (JDBC-compliant java.sql.Timestamp format)

BINARY (v0.8.0+)

Array of bytes.

See discussion below

As for other SQL dialects, the case of these names is ignored.

It’s useful to remember that each of these types is implemented in Java, so the particular behavior details will be exactly what you would expect from the corresponding Java types. For example, STRING is implemented by the Java String, FLOAT is implemented by Java float, etc.

Note that Hive does not support “character arrays” (strings) with maximum-allowed lengths, as is common in other SQL dialects. Relational databases offer this feature as a performance optimization; fixed-length records are easier to index, scan, etc. In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant.

Values of the new TIMESTAMP type can be integers, which are interpreted as seconds since the Unix epoch time (Midnight, January 1, 1970), floats, which are interpreted as seconds since the epoch time with nanosecond resolution (up to 9 decimal places), and strings, which are interpreted according to the JDBC date string format convention, YYYY-MM-DD hh:mm:ss.fffffffff.

TIMESTAMPS are interpreted as UTC times. Built-in functions for conversion to and from timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively (see Chapter 13 for more details).

The BINARY type is similar to the VARBINARY type found in many relational databases. It’s not like a BLOB type, since BINARY columns are stored within the record, not separately like BLOBs. BINARY can be used as a way of including arbitrary bytes in a record and preventing Hive from attempting to parse them as numbers, strings, etc.

Note that you don’t need BINARY if your goal is to ignore the tail end of each record. If a table schema specifies three columns and the data files contain five values for each record, the last two will be ignored by Hive.

What if you run a query that wants to compare a float column to a double column or compare a value of one integer type with a value of a different integer type? Hive will implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE, and cast any integer value to DOUBLE, as needed, so it is comparing identical types.

What if you run a query that wants to interpret a string column as a number? You can explicitly cast one type to another as in the following example, where s is a string column that holds a value representing an integer:

... cast(s AS INT) ...;

(To be clear, the AS INT are keywords, so lowercase would be fine.)

We’ll discuss data conversions in more depth in Casting.

Collection Data Types

Hive supports columns that are structs, maps, and arrays. Note that the literal syntax examples in Table 3-2 are actually calls to built-in functions.

Table 3-2. Collection data types

TypeDescriptionLiteral syntax examples

STRUCT

Analogous to a C struct or an “object.” Fields can be accessed using the “dot” notation. For example, if a column name is of type STRUCT {first STRING; last STRING}, then the first name field can be referenced using name.first.

struct('John', 'Doe')

MAP

A collection of key-value tuples, where the fields are accessed using array notation (e.g., ['key']). For example, if a column name is of type MAP with keyvalue pairs 'first'→'John' and 'last'→'Doe', then the last name can be referenced using name['last'].

map('first', 'John', 'last', 'Doe')

ARRAY

Ordered sequences of the same type that are indexable using zero-based integers. For example, if a column name is of type ARRAY of strings with the value ['John', 'Doe'], then the second element can be referenced using name[1].

array('John', 'Doe')

As for simple types, the case of the type name is ignored.

Most relational databases don’t support such collection types, because using them tends to break normal form. For example, in traditional data models, structs might be captured in separate tables, with foreign key relations between the tables, as appropriate.

A practical problem with breaking normal form is the greater risk of data duplication, leading to unnecessary disk space consumption and potential data inconsistencies, as duplicate copies can grow out of sync as changes are made.

However, in Big Data systems, a benefit of sacrificing normal form is higher processing throughput. Scanning data off hard disks with minimal “head seeks” is essential when processing terabytes to petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks. Navigating each foreign key relationship requires seeking across the disk, with significant performance overhead.

Note

Hive doesn’t have the concept of keys. However, you can index tables, as we’ll see in Chapter 7.

Here is a table declaration that demonstrates how to use these types, an employees table in a fictitious Human Resources application:

CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);

The name is a simple string and for most employees, a float is large enough for the salary. The list of subordinates is an array of string values, where we treat the name as a “primary key,” so each element in subordinates would reference another record in the table. Employees without subordinates would have an empty array. In a traditional model, the relationship would go the other way, from an employee to his or her manager. We’re not arguing that our model is better for Hive; it’s just a contrived example to illustrate the use of arrays.

The deductions is a map that holds a key-value pair for every deduction that will be subtracted from the employee’s salary when paychecks are produced. The key is the name of the deduction (e.g., “Federal Taxes”), and the key would either be a percentage value or an absolute number. In a traditional data model, there might be separate tables for deduction type (each key in our map), where the rows contain particular deduction values and a foreign key pointing back to the corresponding employee record.

Finally, the home address of each employee is represented as a struct, where each field is named and has a particular type.

Note that Java syntax conventions for generics are followed for the collection types. For example, MAP<STRING, FLOAT> means that every key in the map will be of type STRING and every value will be of type FLOAT. For an ARRAY<STRING>, every item in the array will be a STRING. STRUCTs can mix different types, but the locations are fixed to the declared position in the STRUCT.

Text File Encoding of Data Values

Let’s begin our exploration of file formats by looking at the simplest example, text files.

You are no doubt familiar with text files delimited with commas or tabs, the so-called comma-separated values (CSVs) or tab-separated values (TSVs), respectively. Hive can use those formats if you want and we’ll show you how shortly. However, there is a drawback to both formats; you have to be careful about commas or tabs embedded in text and not intended as field or column delimiters. For this reason, Hive uses various control characters by default, which are less likely to appear in value strings. Hive uses the term field when overriding the default delimiter, as we’ll see shortly. They are listed in Table 3-3.

Table 3-3. Hive’s default record and field delimiters

DelimiterDescription

For text files, each line is a record, so the line feed character separates records.

^A (“control” A)

Separates all fields (columns). Written using the octal code 01 when explicitly specified in CREATE TABLE statements.

^B

Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a MAP. Written using the octal code 02 when explicitly specified in CREATE TABLE statements.

^C

Separate the key from the corresponding value in MAP key-value pairs. Written using the octal code 03 when explicitly specified in CREATE TABLE statements.

Records for the employees table declared in the previous section would look like the following example, where we use ^A, etc., to represent the field delimiters. A text editor like Emacs will show the delimiters this way. Note that the lines have been wrapped in the example because they are too long for the printed page. To clearly indicate the division between records, we have added blank lines between them that would not appear in the file:

John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600

Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601

Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.1^A200 Chicago Ave.^BOak Park^BIL^B60700

Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.1^A300 Obscure Dr.^BObscuria^BIL^B60100

This is a little hard to read, but you would normally let Hive do that for you, of course. Let’s walk through the first line to understand the structure. First, here is what it would look like in JavaScript Object Notation (JSON), where we have also inserted the names from the table schema:

{
  "name":  "John Doe",
  "salary": 100000.0,
  "subordinates": ["Mary Smith", "Todd Jones"],
  "deductions": {
    "Federal Taxes": .2,
    "State Taxes":   .05,
    "Insurance":     .1
  },
  "address": {
    "street": "1 Michigan Ave.",
    "city":   "Chicago",
    "state":  "IL",
    "zip":    60600
  }
}

You’ll note that maps and structs are effectively the same thing in JSON.

Now, here’s how the first line of the text file breaks down:

  • John Doe is the name.

  • 100000.0 is the salary.

  • Mary Smith^BTodd Jones are the subordinates “Mary Smith” and “Todd Jones.”

  • Federal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1 are the deductions, where 20% is deducted for “Federal Taxes,” 5% is deducted for “State Taxes,” and 10% is deducted for “Insurance.”

  • 1 Michigan Ave.^BChicago^BIL^B60600 is the address, “1 Michigan Ave., Chicago, 60600.”

You can override these default delimiters. This might be necessary if another application writes the data using a different convention. Here is the same table declaration again, this time with all the format defaults explicitly specified:

CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '01'
COLLECTION ITEMS TERMINATED BY '02'
MAP KEYS TERMINATED BY '03'
LINES TERMINATED BY '
'
STORED AS TEXTFILE;

The ROW FORMAT DELIMITED sequence of keywords must appear before any of the other clauses, with the exception of the STORED AS … clause.

The character 01 is the octal code for ^A. The clause ROW FORMAT DELIMITED FIELDS TERMINATED BY '01' means that Hive will use the ^A character to separate fields.

Similarly, the character 02 is the octal code for ^B. The clause ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '02' means that Hive will use the ^B character to separate collection items.

Finally, the character 03 is the octal code for ^C. The clause ROW FORMAT DELIMITED MAP KEYS TERMINATED BY '03' means that Hive will use the ^C character to separate map keys from values.

The clause LINES TERMINATED BY '…' and STORED AS … do not require the ROW FORMAT DELIMITED keywords.

Actually, it turns out that Hive does not currently support any character for LINES TERMINATED BY … other than ' '. So this clause has limited utility today.

You can override the field, collection, and key-value separators and still use the default text file format, so the clause STORED AS TEXTFILE is rarely used. For most of this book, we will use the default TEXTFILE file format.

There are other file format options, but we’ll defer discussing them until Chapter 15. A related issue is compression of files, which we’ll discuss in Chapter 11.

So, while you can specify all these clauses explicitly, using the default separators most of the time, you normally only provide the clauses for explicit overrides.

Warning

These specifications only affect what Hive expects to see when it reads files. Except in a few limited cases, it’s up to you to write the data files in the correct format.

For example, here is a table definition where the data will contain comma-delimited fields.

CREATE TABLE some_data (
  first   FLOAT,
  second  FLOAT,
  third   FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Use ' ' for tab-delimited fields.

Note

This example does not properly handle the general case of files in CSV (comma-separated values) and TSV (tab-separated values) formats. They can include a header row with column names and column string values might be quoted and they might contain embedded commas or tabs, respectively. See Chapter 15 for details on handling these file types more generally.

This powerful customization feature makes it much easier to use Hive with files created by other tools and various ETL (extract, transform, and load) processes.

Schema on Read

When you write data to a traditional database, either through loading external data, writing the output of a query, doing UPDATE statements, etc., the database has total control over the storage. The database is the “gatekeeper.” An important implication of this control is that the database can enforce the schema as data is written. This is called schema on write.

Hive has no such control over the underlying storage. There are many ways to create, modify, and even damage the data that Hive will query. Therefore, Hive can only enforce queries on read. This is called schema on read.

So what if the schema doesn’t match the file contents? Hive does the best that it can to read the data. You will get lots of null values if there aren’t enough fields in each record to match the schema. If some fields are numbers and Hive encounters nonnumeric strings, it will return nulls for those fields. Above all else, Hive tries to recover from all errors as best it can.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset