Chapter 14 – Creating Databases and Tables

“Create a definite plan for carrying out your desire and begin at once, whether you are ready or not, to put this plan into action. “

- Napoleon Hill

Creating a Database

image

A database in Hive is considered what is termed a namespace. This is a directory designed to separate tables and data structures. There will always be a default database directory that is defined by a parameter named hive.metastore.warehouse.dir when the system is first set up.

The Basics of Creating a Table

image

At the bare minimum, you must include the tablename, along with the column names and their data types, to create a table. A column that does not allow NULLs has a NOT NULL constraint, and it does not accept rows with no value for that column. Data stored in HDFS is schema-on-read, meaning that Hive does not examine data for integrity when it is written to disk. Hive merely assigns a table name to the directory in the Hadoop Distributed File System (HDFS).

The ROW FORMAT will be Delimited or Serde

image

When you create a table you will give it a row format of either Delimited or Serde.

Hive Data Type Fundamentals

image

When you create a table you will give each column a data type.

An Example of a Table Using All Basic Data Types

CREATE TABLE Web_Pages (

State_CodeTINYINT

,Web_IdSMALLINT

,User_NoINT

,ClicksBIGINT

,Return_UserBOOLEAN

,Life_ClicksFLOAT

,Group_ClicksDOUBLE

,ContractSTRING

,SavesBINARY

,Enroll_TimeTIMESTAMP

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

When you create a table you will give each column a data type. In the example above, we have listed all the basic data types.

Settings so Hive Can Automatically Partition a Table

image

Hive supports partitioning a table. In order for hive to Hive to automatically partition a table you must set some settings. The above example sets the first setting to let Hive to allow dynamic partitioning. The second setting tells Hive to not be strict about the partitioning. The third setting informs Hive to not put more than 1000 partitions of a table on a single node.

Creating a Partitioned Table

image

This will result in each Dept_No having its own subfolder in the

underlying warehouse folder /apps/hive/warehouse/emp_partitioned

/dept_no=100/

/dept_no =200/

/dept_no = 300/

A partitioned table is sorted in a particular way so that when users are using range queries the table will not have to be entirely read. For instance, you don't need to read an entire phone book whenever you want to order a pizza. This is because the phone book is in alphabetical order. Defining a partition results in the underlying data being stored in files that are partitioned by a specified column (or columns). This can enhance performance immensely, especially on range queries because the data already separated into files based on the column value. This decreases the number of mappers and the amount of shuffling and sorting of data in the resulting MapReduce job.

Creating an External Table

image

An EXTERNAL table resides out of Hive's control and is often used to extract data from a different source. An external table is just like a regular Hive table, with one exception. Hive will not delete the underlying /apps/hive/warehouse/Tweets_Today folder on a delete because the external table is considered outside of Hive's control.

Creating an External Table With a Specific Location

image

An EXTERNAL table resides out of Hive's control and is often used to extract data from a different source. The keyword LOCATION dictates where the source file (or directory) can be found. Hive table data resides in HDFS, and the data is stored in a special Hive subfolders of /apps/hive/warehouse. External tables are stored in a folder specified by the LOCATION.

INSERT/SELECT is One Method of Loading Data

image

Use the above example when all columns are desired and you want to make an exact copy of the second table. Both tables must have the exact same number of columns in the exact same order with the exact same data types.

Using Buckets for Table Joins

image

These tables join on Customer_Number, so cluster them into

the same amount of buckets on the join key and speed

happens, because the matching rows are on the same mapper.

Hive tables can be organized into buckets which will determine the way the underlying files are stored. Bucketing is excellent because joins perform faster when the cluster by uses the join columns and the same amount of buckets. Sampling is also more efficient because the data is already split up into smaller pieces. The hive formula is consistent because they hash the column value and then uses the modulo operator on the hash value.

Defining Skewed Tables

image

By specifying the values with heavy skew, Hive will

split those out into separate files automatically tune

queries so that it can skip whole files when appropriate.

In the Subscriber_Addresses table above, records with a zip of 45042 or 40210 will be stored in separate files because the assumption is that there will be a large number of customers in those two ZIP codes. This technique is used so there will not be too many like values on the same mapper, thus helping the parallel processing aspect of Hadoop.

Defining a Table Location

Hive usually stores the underlying data in

/apps/hive/warehouse, but you can use the

LOCATION clause in the table create statement

and store it in any folder in the HDFS system.

CREATE EXTERNAL TABLE sales_table (

Product_ID int

,Daily_Sales double

,Sale_Time Timestamp

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LOCATION '/user/teratom/sales_table/' ;

In the example above, the table data for sales_table will be whatever is in the/user/teratom/sales_table directory.

Creating a Text File Table

image

At minimum, you must include the tablename, along with the column names and their data types to create a table. The table create above has columns that are delimited by the Unicode U0001 character. The format used to store data is TEXTFILE.

Distribute By for Loading Data

image

Hive uses the columns in distribute by to distribute the rows among nodes, by hashing the DISTRIBUTE BY column. In other words, all rows with the same customer_number will go to the same node. This is because a single math formula (Hash Formula) is used. Distribute by is typically used in conjunction with an insert statement or also when using Hadoop streaming with custom mappers and/or nodes.

Sort By on Data Loads

image

The distribute by clause does not guarantee any type of sorting of the records. You can use sort by to cluster records with the same distribute by column together. This will place like records adjacent to one another on the node they were mapped to. If you use distribute by followed with a sort by on the same column, you can use cluster by and get the same result.

Cluster By Distributes and Sorts by the Same Key

image

Cluster by will put like customer_number values on the same node and then sort by customer_number. The

Hive does Not Store Data, But HDFS Does in These Formats

Table data in HDFS is stored in one of the following formats:

Text file - Comma, tab, or other delimited files.

SequenceFile - Serialized key/value pairs that can quickly be deserialized within the Hadoop system.

RCFile - A record columnar file that organizes data in folders by columns vs. the traditional row format).

ORC File - The optimized row columnar format that improves the Hive's efficiency by striping data in 64 MB (Hive 0.14).

Hive does not actually store data like a traditional RDBMS, but instead the data is stored in HDFS. There are many options that include a Text File, SequenceFile, RCFile or an ORC File.

Creating Tables as a Text file

image

The example above is a text file. Text files can be Comma, tab, or other delimited file types.

Hive SerDes Means Serializer/Deserializer

Records can be stored in a wide variety of formats because of Hadoop's ability to write Java classes, or you have the option to use one of any the s built-in SerDes that are listed below:

AvroSerDe – Used for reading/writing files using the Avro schema.

RegexSerDe – When a regular expression is used to deserialize data.

ColumnarSerDe – Only used on columnar storage with RCFiles.

OrcSerDe – Used for reading/writing ORC files.

The term SerDe is the shorter version for the concept of serializer/deserializer. The concept entails how records read in from a table are deserialized. Whenever they are written back to HDFS they are serialized. Think of it a little bit like something that is decompressed for reading and compressed again when written to disk.

Creating a Table as a SERDE

Using SerDes requires the ROW FORMAT SERDE clause

CREATE TABLE emails (Sender string

,Sent_to string

,Message string

,Ccd string)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

The example above creates a SERDE FORMAT table. The term SerDe is the shorter version for the concept of serializer/deserializer. The concept entails how records read in from a table are deserialized. Whenever they are written back to HDFS they are serialized. Think of it a little bit like something that is decompressed for reading and compressed again when written to disk.

Creating Tables as a SERDE with Advanced Options

Using SerDes requires the ROW FORMAT SERDE clause

CREATE TABLE emails (Sender string

,Sent_to string

,Message string

,Ccd string)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITH SERDEPROPERTIES

('avro.schema.url'='file:///tmp/schema.avsc')

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' ;

Above is an example that might not work unless you have the url in place. This is an example of a SERDE FORMAT with SERDEPROPERTIES.

Creating Tables as an RCFile

CREATE TABLE Address_Columnar

(Subscriber_NoINT

,StreetSTRING

,CitySTRING

,StateSTRING

,ZipINT

,AreacodeSMALLINT

,PhoneINT

) STORED AS RCFile

RCFiles do not organize data in rows like traditional systems, but instead, use a columnar format. This means that a node will still be responsible for an entire row, but each column in the row is stored in their own files. If a node was responsible for 100 rows, with each containing 5 columns, the storage would be five files (one per column) with each holding 100 entries. The column entries would all perfectly align so the row can easily be re-built for a SELECT * query.

Creating Tables as ORC files

image

ORC files are created with sets of rows called stripes, and the stipe size is by default 64 MBs. ORC is available in Hive 0.14 and later. These large stripes allow Hadoop to bring larger amounts of data into memory so it is more efficient when reading columns. An ORC file will always contain a footer that lists all of the stripes locations in the table, along with a count, min values, max values and a sum for column data. An ORC file will also hold compression parameters and the size of the compressed footer using a postscript.

Altering a Table to Add a Column

image

This statement adds a column named Mgr_Phone to the Department table. The data type must always be included.

Renaming a Table

image

To rename a table, use the ALTER TABLE statement just like the example above.

Dropping a Table

image

Dropping a table is easy to do and there is no "Oops" button. Be careful!

Creating a Table Using a CTAS

image

You can create a table from a CTAS (Create Table AS) statement as in the above example. This not only creates a new table, but populates it with data in one step. We were able to run a query right after the create statement.

Creating a Table Using a CTAS Join

image

You can create a table from a CTAS (Create Table AS) join statement as in the above example.

Creating a Temporary Table Using a CTAS

image

You can create a temporary table using a CTAS (Create Table AS) statement, as in the above example.

Creating a Temporary Table Using a LIKE Command

image

You can create a temporary table from a LIKE command, but you must also populate the table afterwards.

Collecting Statistics – Cost Based Optimization (CBO)

image

The ANALYZE TABLE command gathers statistics for a table, a partition and the columns within the table. The statistics help the optimizer choose the best cost-based plan for the nodes to get the data. These statistics are written to the Hive metastore. Above, we have just collected statistics above on the Employee_Table.

Collecting Statistics on Particular Columns of a Table

image

The ANALYZE TABLE command gathers statistics for a table, a partition and the columns within the table. The statistics help the optimizer choose the best cost-based plan for the nodes to get the data. These statistics are written to the Hive metastore. Above, we have just collected statistics above on the columns Customer_Number and Order_Time in the Order_Table.

Best Practices for Hive Cost Based Optimization

If you have a table, such as the Order_Table that joins to other
tables on Customer_Number and Order_Time you would first:

ANALYZE TABLE Order_Table COMPUTE STATISTICS  ;

You would then compute statistics at the column level for
Customer_Number and Order_Time.

ANALYZE TABLE Order_Table COMPUTE STATISTICS
FOR COLUMNS
Customer_Number, Order_Time ;

You would then set the following properties to enable CBO:

SET hive.compute.query.using.stats=true;

SET hive.cbo.enable=true;

SET hive.stats.fetch.column.stats=true;

There are two terms that worth mentioning, Optiq and CBO (Cost Based Optimization). Optiq is used to reorder joins in order to choose the best join plan. Cardinality and boundary statistics are used for this form of join cost-based optimization.

Setting the Following Properties to Enable CBO

image

Now that you have set the properties to enable Cost Based Optimization (CBO), the system will utilize the stats in order to build a better plan for joins and querying a table.

Vectorization

image

Vectorization allows Hive to process a batch up to 1,024 rows together instead of processing a single row at a time. To take advantage of vectorization, your table needs to be in the ORC format. You also need to enable vectorization as in the example above. Vectorization is a joint effort between Hortonworks and Microsoft. This skyrockets Hive performance.

Use the DESCRIBE FORMATTED Function to See Statistics

image

The ANALYZE TABLE command gathers statistics for a table, a partition and the columns within the table. The statistics help the optimizer choose the best cost-based plan for the nodes to get the data. These statistics are written to the Hive metastore. Above, we have just collected statistics above on the columns Customer_Number and Order_Time in the Order_Table.

Hadoop Numeric Data Types

Tinyint – The size is 1-byte and it is signed. It has a range of -128 to +127.

Smallint – The size is 2 bytes it is signed. It has a range of
-32768 to +32767.

Int – The size is 4 bytes and tables will error if you spell out INTEGER. It
is signed and has a range of values of -2,147,483,648 to +2,147,483,647.

Bigint – The size is 8 bytes and it is signed. The range of values is -
9,223,372,036,854,775,808 to +9,223,372,036,854,775,707.

Float– The size is 4 bytes and it is a single precision floating point number.

Double – The size is 8 bytes and it has a double precision floating point
number.

Decimal – Only introduced in Hive 0.11.0 with a precision of 38 digits.
Hive 0.13.0 introduced use definable precision and scale.

Here are some of the data types for tables in a Hadoop system. Turn the page for more.

Hadoop Date/Time Data Types

Timestamp – Timestamps are interpreted to be without a timezone and they are stored as an offset from the UNIX epoch. They support optional nanosecond precision and use a JDBC compliant format of "YYYY-MM-DD HH:MM:SS.fffffffff". Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. They are only available starting with Hive 0.8.0 .

Date – DATE values follow a form of YYYY--MM--DD. Date types do not have the time of day included. They have a range of values of 0000--01--01 to 9999--12--31. Dates were introduced in Hive 0.12.0.

Here are some of the data types for tables in a Hadoop system. Turn the page for more.

Hadoop String Data Types Continued

String – String literals can be expressed with either single quotes (') or double quotes (").

Varchar[ (n) ] – This a variable character data type that only became available starting with Hive 0.12.0.

Character varying [ (n) ] – This a fixed-length character data type that only became available starting with Hive 0.13.0. Varchar types are created with a length specifier (between 1 and 65355), which is the maximum number of characters allowed. The string is silently truncated if the value being converted/assigned to a varchar value exceeds the length specifier.

Here are some of the data types for tables in a Hadoop system. Turn the page for more.

Hadoop Miscellaneous Data Types Continued

Boolean – The size is 1 byte and it is a logical Boolean that provides a true/false (true/false, t/f, yes/no, y/n or 1/0).

Binary – Only available starting with Hive 0.8.0.

Array – Only available starting with Hive 0.8.0.

Here are some of the data types for tables in a Hadoop system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset