Chapter 4. Data Definition

A NOTE FOR EARLY RELEASE READERS

This will be the 5th chapter of the final book. Please note that the Github repo will be made active later on.

If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material, please reach out to the author at [email protected].

Overview

Presto provides a set of Data Definition Language (DDL) statements for creating and manipulating database objects such as schemas, tables, columns, and views. Presto also provides a set of built in SQL Standard data types. In this chapter we will learn about both DDL and data types in Presto.

Catalogs

From the previous chapters we’ve learn how Presto access its data via connectors, which are mounted in Presto as catalogs. Catalogs contains one or more schemas which provide a collection of tables. For example, the Hive connector maps each Hive database to a schema which contain Hive tables. As another example, the PostgreSQL connector maps each PostgreSQL schema to a schema in Presto.

There is currently no DDL that allows one to create a catalog. You must configure a connector with Presto to connect to a particular data source. A configured connector is exposed as catalog within Presto.

        presto> show catalogs;
Catalog
---------
hive
postgres
kafka
system
(4 rows)

We’ll discuss connectors and configurations in greater detail in {{Chapter X}}.

Schema Definition

As we learned in Chapter 2, Getting Started with Presto, Presto contains Catalogs which contain Schema. Schemas hold tables, views, and various other objects.

Creating a Schema

Let’s look at the SQL statement to create a schema.

CREATE SCHEMA [ IF NOT EXISTS ] schema_name
[ WITH ( property_name = expression [, ...] ) ]

The WITH clause also associating properties with the schema. For example, for the Hive connector creating a schema will actually create a database in Hive. It is sometimes desirable to override the default location for the database as specified by hive.metastore.warehouse.dir.

CREATE SCHEMA hive.web
WITH (location = 's3://starburst-oreilly/web/')

Always refer to the latest Presto documentation for the list of schema properties. Or you can also query the list via SQL in Presto

presto> select * from system.metadata.schema_propertiesG

-[ RECORD 1 ]-+------------------------------
catalog_name | hive
property_name | location
default_value |
type | varchar
description | Base file system location URI

Altering a Schema

Using the ALTER SCHEMA statement, you can change the name of an existing schema.

ALTER SCHEMA name RENAME TO new_name

Dropping a Schema

Using the DROP SCHEMA statement, you can drop a schema.

DROP SCHEMA [ IF EXISTS ] schema_name

Specify the IF EXISTS piece if you do not want the statement to error if the schema does not exists. You will also not be able to drop a schema is tables exists in the schema. You need to drop the tables first. Some database systems support a CASCADE keyword that indicates the DROP statement to drop everything within the object such as a schema. However, Presto does not support this at this time.

Table Definitions

Now that we understand catalogs and schemas, let’s learn about table definitions in Presto.

Creating a Table

Let’s take a look at CREATE TABLE for creating a table in Presto.

CREATE TABLE [ IF NOT EXISTS ]
table_name (
{ column_name data_type [ COMMENT comment ] [ WITH ( property_name = expression
[, ...] ) ]
| LIKE existing_table_name [ { INCLUDING | EXCLUDING } PROPERTIES ] }
[, ...]
)
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]

This general syntax should look familiar to you if you know SQL. In Presto, the optional WITH clause has a very important use. Other systems such as Hive, have extended the SQL language so that users can specify properties that cannot be otherwise expressed in standard SQL. Following this approach with Presto would not only violate the underlying philosophy to stay as close to the SQL standard as possible, it would also become unmanageable as more connectors are added. Therefore, using table and column properties is a good way to accomplish this.

Example: Creating Tables using the Hive Connector

In {{Chapter X}} on Data Organization, we saw how youcan use the Hive connector to manage Hive-table table format for data in distributed storage systems such as HDFS, S3, Azure Storage, or Google Cloud Storage. Let’s look at a few of the table properties in the Hive connector.

Property Name Property Description
external_location This is the file system location for an external Hive table. For example, this could be a location AWS S3 or Azure Blob Storage.
format This is the file storage format for the underlying data. For example, this could be ORC, AVRO, PARQUET, etc.

This is not an inclusive list, and in {{Chapter X}} on Presto connectors, we will discuss the various properties available for each connector in more detail. Using the properties above, let’s create a table in Hive via Presto that is identical to how the table would have been created in Hive.

Let’s start with this Hive syntax1:

CREATE EXTERNAL TABLE page_views(
view_time INT,
user_id BIGINT,
page_url STRING,
view_date DATE,
country STRING)
STORED AS ORC
LOCATION 's3://starburst-oreilly/web/page_views/';

Using Presto, the table can defined as:

CREATE TABLE hive.web.page_views(
view_time timestamp,
user_id BIGINT,
page_url VARCHAR,
view_date DATE,
country VARCHAR
)
WITH (
format = 'ORC',
external_location = 's3://starburst-oreilly/web/page_views'
);

As you can see, the Hive DDL has extended SQL which may work ok for Hive. However, it would not work well with Presto and be hard to maintain because of the connector architecture approach. Therefore encapsulating these properties within a key-value list is a much cleaner approach.

Table and Column Properties

As Presto advances, new connectors and properties are added. You should refer to the online Presto documentation for the most up to date set of connectors and properties.

You can also query the system metadata. To list the available table properties, you can run the following query:

SELECT * FROM system.metadata.table_properties;

To list the available column properties, you can run the following query:

SELECT * FROM system.metadata.column_properties;

Copying an Existing Table Definition

You can create a new table by using an existing tables as a template. The LIKE clause will create a table with the same column definitions from an existing table. By default this will not copy of the table and column properties. If you wish to table and column properties, be sure to include INCLUDING PROPERTIES in the syntax. This feature is useful when performing some type of transformation of the data using Presto.

CREATE TABLE hive.web.page_view_bucketed(
comment VARCHAR,
LIKE hive.web.page_views INCLUDING PROPERTIES
)
WITH (
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)

If you want to see what your newly created table definition is you can use the SHOW statement.

presto:web> show create table hive.web.page_view_bucketed;
Create Table
---------------------------------------------------
CREATE TABLE hive.web.page_view_bucketed (
comment varchar,
view_time timestamp,
user_id bigint,
page_url varchar,
view_date date,
country varchar
)
WITH (
bucket_count = 50,
bucketed_by = ARRAY['user_id'],
format = 'ORC',
partitioned_by = ARRAY['view_date','country'],
sorted_by = ARRAY[]
)
(1 row)

For reference, this was the original table we copied.

presto:web> show create table hive.web2.page_views;
Create Table
--------------------------------------------------
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
view_date date,
country varchar
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['view_date','country']
)
(1 row)

Creating a New Table Definition from Query Results

If you are comfortable with SQL, then you probably know about the CREATE TABLE AS or (CTAS) statement which can be used to create a new table that contains the results of a SELECT query. The column definitions for the table are created dynamically by looking at the result column data from the query. The statement can be used for creating temporary tables in Presto or as part of some process to create transformed tables.

CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]

By default, the new table will be populated with the result data from the query. For example, you may use CTAS to create and load unpartitioned data in TEXTFILE format into a partitioned data in ORC format.

CREATE TABLE hive.web.page_views_orc_part 
WITH (
format = 'ORC',
partitioned_by = ARRAY['view_date','country']
)
AS
SELECT *
FROM hive.web.page_view_text

Additionally, you may use CTAS for transforming data. Below is an example of creating a table from the resulting sessionization query over the page_views table.

CREATE TABLE hive.web.user_sessions 
AS
SELECT user_id,
view_time,
sum(session_boundary)
OVER (
PARTITION BY user_id
ORDER BY view_time) AS session_id
FROM (SELECT user_id,
view_time,
CASE
WHEN to_unixtime(view_time) -
lag(To_unixtime(view_time), 1)
OVER(
PARTITION BY user_id
ORDER BY view_time) >= 30
THEN 1
ELSE 0
END AS session_boundary
FROM page_views) T
ORDER BY user_id,
session_id

Occasionally you may with to use CTAS as a way to simply copy the table definition without actually copying the data itself. You can do this by adding WITH NO DATA clause at the end of the statement.

Altering a Table

Using the ALTER TABLE statement in Presto you can perform actions such as renaming a table, adding a column to the table, dropping a column from a table, and renaming a column in a table.

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMN column_name data_type
[ COMMENT comment ] [ WITH ( property_name = expression [, ...] ) ]
ALTER TABLE name DROP COLUMN column_name
ALTER TABLE name RENAME COLUMN column_name TO new_column_name

It’s important to note that depending on the connector and authorization model for the connector, these operations may not be allowed using the default behavior. For example, the Hive connector restricts these operations by default. For more information on authorization, refer {{Chapter X}}.

Deleting from a Table

Using the DELETE statement in Presto you can delete rows of data from a table. The statement provides and optional WHERE clause to restrict which rows are deleted. If the WHERE clause is omitted, then all the data may be deleted from the table.

DELETE FROM table_name [ WHERE condition ]

As with other DDL, various connectors may have limited or not support for the DELETE statement. For example, the DELETE statement is not supported (or makes much sense) with the Kafka connector. And the Hive connector only supports DELETE if there is a WHERE clause specifying a partition key which can be used to delete entire partitions.

DELETE FROM hive.web.page_views
WHERE view_date = DATE '2019-01-14' AND country = 'US'

Dropping a Table

Using the DROP table statement, you can drop the table.

DROP TABLE [ IF EXISTS ] table_name

Depending on the connector implementation, this may or may not drop the underlying data. You should refer to the connector documentation for further explanation.

DDL with Connectors

So far in this chapter we’ve gone over the various DDL statements Presto supports. However, it does not mean that every data source in Presto supports the DDL or provide the same semantics. This is determined by the connector implementation and the capabilities of the underlying data source.

If you try a statement or operation that is not supported by a particular connector Presto will return an error. For example, the System connector is used to query information about the Presto system. It does not support or makes much sense to allow one to create tables.

presto:web> use system.runtime;
USE
presto:runtime> create table foo(a int);
Query failed: This connector does not support creating tables

Data Types

Presto supports most of the data types described by the SQL Standard, which are also supported by many relational databases. In this section, we will discuss data types specific to Presto. However, as with the DDL not all Presto connectors support all Presto’s type. And Presto may not support all the types from the underlying data source of a connector either. It is also connector implementation dependent as to how the data types are translated to and from the underlying data source into Presto.The following tables describe the example data types in Presto and provide example data where applicable.

Boolean Data Type

Type Description Example
BOOLEAN Boolean value true or false True

Integer Data Types

Type Description Example
TINYINT 8-bit signed integer. It has a minimum value of -2^7 and a maximum value of 2^7 - 1. 42
SMALLINT 16-bit signed integer. It has a minimum value of -2^15 and a maximum value of 2^15 - 1. 42
INTEGER, INT 32-bit signed integer. It has a minimum value of -2^31 and a maximum value of 2^31 - 1. INT is an alias for INTEGER. 42
BIGINT 64-bit signed integer. It has a minimum value of -2^63 and a maximum value of 2^63 - 1. 42

Floating-Point Data Types

Type Description Example
REAL 32-bit Floating-Point. Implemented by the IEEE Standard 754 for Binary Floating-Point Arithmetic. 2.71828
DOUBLE 32-bit Floating-Point. the IEEE Standard 754 for Binary Floating-Point Arithmetic. 2.71828

Fixed-Precision Data Types

Type Description Example
DECIMAL A fixed precision decimal number. See section in Decimal for an in depth description. 1234567890

String Data Types

Type Description Example
VARCHAR
VARCHAR(n)
A variable length string of characters. There is an optional maximum length when defined as VARCHAR(n), where n is a positive integer representing the maximum number of characters. “Hello World”
CHAR
CHAR(n)
A fixed length string of characters. There is an optional length when defined as CHAR(n). Where n is a positive integer defining the length of character. CHAR is equivalent to CHAR(1). “Hello World “
VARBINARY A variable length binary byte string.

Unlike VARCHAR, CHAR will always have n characters. Here are some behavior and errors you should be aware:

  • If you are trying to cast a character string that has less characters than n, trailing spaces will be added.

  • If you are trying to cast a character string has more than n characters, it will be truncated.

  • If you attempt to insert into a table a VARCHAR or CHAR longer than as defined in the column, an error will occur.

  • If you attempt to insert into a table a CHAR that is shorter than as defined in the column, the value will be space padded to match the defined length.

  • If you attempt to insert into a table a VARCHAR that is shorter than as defined in the column, the exact length of the string is stored. Leading and trailing spaces are included when comparing CHAR values.

The below examples highlight these behaviors:

presto> select length(cast('hello world' as char(15)));
_col0
-------
15
(1 row)
presto> select cast('hello world' as char(15)) || '~';
_col0
------------------
hello world ~
(1 row)
presto> select cast('hello world' as char(5));
_col0
-------
hello
(1 row)
presto> select length(cast('hello world' as varchar(15)));
_col0
-------
11
(1 row)
presto> select cast('hello world' as varchar(15)) || '~';
_col0
--------------
hello world~
(1 row)
presto:default> select cast('hello world' as char(15)) = cast('hello world'
as char(14));
_col0
-------
false
(1 row)
presto:default> select cast('hello world'
as varchar(15)) = cast('hello world' as varchar(14));
_col0
-------
true
(1 row)


presto:default> create table varchars(col varchar(5));
CREATE TABLE

presto:default> insert into varchars values('1234');
INSERT: 1 row
presto:default> insert into varchars values('123456');
Query failed: Insert query has mismatched column types: Table: [varchar(5)],
Query: [varchar(6)]

Temporal Data Types

The following tables describe temporal data types. Temporal data types are related to time. Coarsely, these data types could be timestamps, dates, time, or intervals.

Type Description Example
DATE A Calendar date representing the year, month, and day. DATE `1983-10-19’
TIME A time of day representing hour, minute, second, and millisecond, but not including a time zone. TIME '02:56:15.123'
TIME WITH TIME ZONE A time of day representing hour, minute, second, and millisecond, including a time zone. See examples below.
TIMESTAMP A date and time without a time zone. See examples below.
TIMESTAMP WITH TIMEZONE A date and time with a time zone. See examples below.
INTERVAL YEAR TO MONTH An interval span of years and days. INTERVAL ‘1-2’ YEAR TO MONTH
See section below on Intervals.
INTERVAL DAY TO SECOND An interval span of days, hours, minutes, seconds, and milliseconds. INTERVAL ‘5’ DAY to SECOND
See section below on Intervals.

Input and Output formats and representations for Temporal Types

In Presto, TIMESTAMP is represented as a Java Instant2 type representing the amount of time before or after the Java epoch. This should be transparent to the end user as values are parsed and displayed in a different format.

For types that do not include TIMEZONE information, the values are parsed and displayed according to the Presto session time zone. For types that include the TIMEZONE information, the values are parsed and displayed using the TIMEZONE.

String literals can be parsed by Presto into a TIMESTAMP, TIMESTAMP WITH TIMEZONE, TIME, TIME WITH TIMEZONE, or DATE. The following tables describe the formats accepted by Presto for parsing. If you want to use ISO 8601, you can use the from_iso8601_timestamp or from_iso8601_date functions. See {{Chapter X}} on Functions in the Presto for more information.

TIMESTAMP TIMESTAMP WITH TIMEZONE
yyyy-M-d yyyy-M-d ZZZ
yyyy-M-d H:m yyyy-M-d H:m ZZZ
yyyy-M-d H:m:s yyyy-M-d H:m:s ZZZ
yyyy-M-d H:m:s.SSS yyyy-M-d H:m:s.SSS ZZZ
TIME TIME WITH TIMEZONE
H:m H:m ZZZ
H:m:s H:m:s ZZZ
H:m:s.SSS H:m:s.SSS ZZZ
DATE
YYYY-MM-DD

When printing the output for TIMESTAMP, TIMESTAMP WITH TIMEZONE, TIME, TIME WITH TIMEZONE, or DATE, Presto will output in the formats below. If you want to output in strict ISO 8601 format then you can use the to_iso801 function. See {{Chapter X}} on Function in the Presto for more information.

Data Type Format
TIMESTAMP yyyy-MM-dd HH:mm:ss.SSS ZZZ
TIMESTAMP WITH TIMEZONE yyyy-MM-dd HH:mm:ss.SSS ZZZ
TIME yyyy-MM-dd HH:mm:ss.SSS ZZZ
TIME WITH TIMEZONE yyyy-MM-dd HH:mm:ss.SSS ZZZ
DATE YYYY-MM-DD

Timezones in Presto

Timezone adds important additional information the type. Timezones are defined by IANA3 or via the UTC offset time. While Presto supports TIME WITH TIME ZONE, it’s often best to use this in conjunction with a DATE or use TIMESTAMP. This is because the daylight savings times should be accounted for and a DATE is needed for that.

Example Timezones

America/New_York

America/Los_Angeles

Europe/Warsaw

+08:00

−10:00

The below examples highlight some of the behaviors described above:

presto> select TIME '02:56:15 UTC';
_col0
------------------
02:56:15.000 UTC
(1 row)
presto> select TIME '02:56:15 UTC' AT TIME ZONE 'America/Los_Angeles';
_col0
----------------------------------
18:56:15.000 America/Los_Angeles
presto> select TIME '02:56:15 UTC' AT TIME ZONE '-08:00';
_col0
---------------------
18:56:15.000 -08:00
(1 row)
presto> select TIMESTAMP '1983-10-19 07:30:05.123';
_col0
-------------------------
1983-10-19 07:30:05.123
(1 row)
presto> select TIMESTAMP '1983-10-19 07:30:05.123 America/New_York' AT TIME
ZONE 'UTC';
_col0
-----------------------------
1983-10-19 11:30:05.123 UTC
(1 row)

Intervals in Presto

The data type INTERVAL are either YEAR TO MONTH or DAY TO SECOND.

YEAR TO MONTH
INTERVAL ‘<years>-<months>’ YEAR TO MONTH
INTERVAL ‘<years>’ YEAR TO MONTH
INTERVAL ‘<years>’ YEAR
INTERVAL `<months>’ MONTH
DAY TO SECOND
INTERVAL ‘<days> <time>’ DAY TO SECOND
INTERVAL ‘<days>’’ DAY TO SECOND
INTERVAL ‘<days>’ DAY
INTERVAL ‘<hours>’ HOUR
INTERVAL ‘<minutes>’ MINUTE
INTERVAL ‘<seconds>’ SECOND

The below examples highlight some of the behaviors described above:

presto> select INTERVAL '1-2' YEAR TO MONTH;
_col0
-------
1-2
(1 row)
presto> select INTERVAL '4' MONTH;
_col0
-------
0-4
(1 row)
presto> select INTERVAL '4-1' DAY TO SECOND;
Query 20181218_215339_00075_8bmfb failed: Invalid INTERVAL DAY TO SECOND
value: 4-1
presto> select INTERVAL '4' DAY TO SECOND;
_col0
----------------
4 00:00:00.000
(1 row)
presto> select INTERVAL '4 01:03:05.44' DAY TO SECOND;
_col0
----------------
4 01:03:05.440
(1 row)
presto> select INTERVAL '05.44' SECOND;
_col0
----------------
0 00:00:05.440
(1 row)

Collection Data Types

Collection Data Type Example
ARRAY ARRAY[‘apples’, ‘oranges’, ‘pears’]
MAP MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3])
JSON {"a":1,"b":2,"c":3}
ROW ROW(1, 2, 3)

In {{Chapter X}} on Functions and Operators, we will learn more examples of how to use these data types.

Custom Data Types

Presto also support custom data types that can be provided as a plugin to Presto. Refer to {{Chapter X}} about the Presto SPI for more detail.

Type Casting

Sometimes it is necessary to explicitly cast a value or literal to a different type. For this we can use the CAST function.

cast(value AS type)

Now let’s say you need to compare a DATE to a literal.

presto:web> SELECT * 
-> FROM hive.web.page_views
-> WHERE view_date > '2019-01-01';
Query failed: line 1:42: '>' cannot be applied to date, varchar(10)

This query failed because Presto does not have a comparison operator “>” (greater than) that knows how to compare a DATE and VARCHAR. However, it has a comparison function that knows how to compare a DATE to another DATE. Therefore, we need to use the CAST function to coerce one of the types. In this example, it makes the most sense to convert the literal to a DATE.

presto:web> SELECT * 
-> FROM hive.web.page_views
-> WHERE view_date > CAST('2019-01-01' as DATE);
view_time | user_id | page_url | view_data | country
-------------------------+---------+----------+------------+---------
2019-01-26 20:40:15.477 | 2 | http:// | 2019-01-26 | US
2019-01-26 20:41:01.243 | 3 | http:// | 2019-01-26 | US
...

It is actually possible to have coerced the DATE column view_date to a VARCHAR since there also exists a comparison operator that compares two VARCHARs. But it makes the most sense in this example to compare two DATEs.

Presto also provides another conversation function try_cast that will attempt to do the type coercion, but unlike cast which will return an error if the cast fails, try_cast will return a null value. This can be useful when an error is not necessary.

try_cast(value AS type)

Let’s take for example coercing a character literal to a number type.

presto:web> SELECT cast('1' as integer);
_col0
-------
1
(1 row)
presto:web> SELECT cast('a' as integer);
Query failed: Cannot cast 'a' to INT
presto:web> SELECT try_cast('a' as integer);
_col0
-------
NULL
(1 row)

Data Types with Connectors

The data types described above are the data types supported within the Presto query engine. However, given Presto’s connector base architecture, the underlying data sources may not support the same type or the same type may be named different. For example, the MySQL connector will map the Presto REAL type to a MySQL FLOAT. Sometimes the types need to be converted themselves where some connectors will convert an unsupported type into a Presto VARCHAR or it may ignore reading the column entirely. The connector implementation will know how to map and coerce the underlying data types to and from the data source with Presto. It’s best to refer to the connector’s documentation to best understand the type mapping behavior.

1 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

2 https://docs.oracle.com/javase/9/docs/api/java/time/Instant.html

3 https://www.iana.org/time-zones

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset