This will be the 5th chapter of the final book. Please note that the Github repo will be made active later on.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material, please reach out to the author at [email protected].
Presto provides a set of Data Definition Language (DDL) statements for creating and manipulating database objects such as schemas, tables, columns, and views. Presto also provides a set of built in SQL Standard data types. In this chapter we will learn about both DDL and data types in Presto.
From the previous chapters we’ve learn how Presto access its data via connectors, which are mounted in Presto as catalogs. Catalogs contains one or more schemas which provide a collection of tables. For example, the Hive connector maps each Hive database to a schema which contain Hive tables. As another example, the PostgreSQL connector maps each PostgreSQL schema to a schema in Presto.
There is currently no DDL that allows one to create a catalog. You must configure a connector with Presto to connect to a particular data source. A configured connector is exposed as catalog within Presto.
presto> show catalogs;
Catalog
---------
hive
postgres
kafka
system
(4 rows)
We’ll discuss connectors and configurations in greater detail in {{Chapter X}}.
As we learned in Chapter 2, Getting Started with Presto, Presto contains Catalogs which contain Schema. Schemas hold tables, views, and various other objects.
Let’s look at the SQL statement to create a schema.
CREATE SCHEMA [ IF NOT EXISTS ] schema_name
[ WITH ( property_name = expression [, ...] ) ]
The WITH
clause also associating properties with the schema. For example, for the Hive connector creating a schema will actually create a database in Hive. It is sometimes desirable to override the default location for the database as specified by hive.metastore.warehouse.dir
.
CREATE SCHEMA hive.web
WITH (location = 's3://starburst-oreilly/web/')
Always refer to the latest Presto documentation for the list of schema properties. Or you can also query the list via SQL in Presto
presto> select * from system.metadata.schema_propertiesG
-[ RECORD 1 ]-+------------------------------
catalog_name | hive
property_name | location
default_value |
type | varchar
description | Base file system location URI
Using the ALTER SCHEMA
statement, you can change the name of an existing schema.
ALTER SCHEMA name RENAME TO new_name
Using the DROP SCHEMA
statement, you can drop a schema.
DROP SCHEMA [ IF EXISTS ] schema_name
Specify the IF EXISTS
piece if you do not want the statement to error if the schema does not exists. You will also not be able to drop a schema is tables exists in the schema. You need to drop the tables first. Some database systems support a CASCADE
keyword that indicates the DROP
statement to drop everything within the object such as a schema. However, Presto does not support this at this time.
Now that we understand catalogs and schemas, let’s learn about table definitions in Presto.
Let’s take a look at CREATE TABLE
for creating a table in Presto.
CREATE TABLE [ IF NOT EXISTS ]
table_name (
{ column_name data_type [ COMMENT comment ] [ WITH ( property_name = expression
[, ...] ) ]
| LIKE existing_table_name [ { INCLUDING | EXCLUDING } PROPERTIES ] }
[, ...]
)
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
This general syntax should look familiar to you if you know SQL. In Presto, the optional WITH
clause has a very important use. Other systems such as Hive, have extended the SQL language so that users can specify properties that cannot be otherwise expressed in standard SQL. Following this approach with Presto would not only violate the underlying philosophy to stay as close to the SQL standard as possible, it would also become unmanageable as more connectors are added. Therefore, using table and column properties is a good way to accomplish this.
In {{Chapter X}} on Data Organization, we saw how youcan use the Hive connector to manage Hive-table table format for data in distributed storage systems such as HDFS, S3, Azure Storage, or Google Cloud Storage. Let’s look at a few of the table properties in the Hive connector.
Property Name | Property Description |
external_location |
This is the file system location for an external Hive table. For example, this could be a location AWS S3 or Azure Blob Storage. |
format |
This is the file storage format for the underlying data. For example, this could be ORC , AVRO , PARQUET , etc. |
This is not an inclusive list, and in {{Chapter X}} on Presto connectors, we will discuss the various properties available for each connector in more detail. Using the properties above, let’s create a table in Hive via Presto that is identical to how the table would have been created in Hive.
Let’s start with this Hive syntax1:
CREATE EXTERNAL TABLE page_views(
view_time INT,
user_id BIGINT,
page_url STRING,
view_date DATE,
country STRING)
STORED AS ORC
LOCATION 's3://starburst-oreilly/web/page_views/';
Using Presto, the table can defined as:
CREATE TABLE hive.web.page_views(
view_time timestamp,
user_id BIGINT,
page_url VARCHAR,
view_date DATE,
country VARCHAR
)
WITH (
format = 'ORC',
external_location = 's3://starburst-oreilly/web/page_views'
);
As you can see, the Hive DDL has extended SQL which may work ok for Hive. However, it would not work well with Presto and be hard to maintain because of the connector architecture approach. Therefore encapsulating these properties within a key-value list is a much cleaner approach.
As Presto advances, new connectors and properties are added. You should refer to the online Presto documentation for the most up to date set of connectors and properties.
You can also query the system metadata. To list the available table properties, you can run the following query:
SELECT * FROM system.metadata.table_properties;
To list the available column properties, you can run the following query:
SELECT * FROM system.metadata.column_properties;
You can create a new table by using an existing tables as a template. The LIKE
clause will create a table with the same column definitions from an existing table. By default this will not copy of the table and column properties. If you wish to table and column properties, be sure to include INCLUDING PROPERTIES
in the syntax. This feature is useful when performing some type of transformation of the data using Presto.
CREATE TABLE hive.web.page_view_bucketed(
comment VARCHAR,
LIKE hive.web.page_views INCLUDING PROPERTIES
)
WITH (
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
If you want to see what your newly created table definition is you can use the SHOW
statement.
presto:web> show create table hive.web.page_view_bucketed;
Create Table
---------------------------------------------------
CREATE TABLE hive.web.page_view_bucketed (
comment varchar,
view_time timestamp,
user_id bigint,
page_url varchar,
view_date date,
country varchar
)
WITH (
bucket_count = 50,
bucketed_by = ARRAY['user_id'],
format = 'ORC',
partitioned_by = ARRAY['view_date','country'],
sorted_by = ARRAY[]
)
(1 row)
For reference, this was the original table we copied.
presto:web> show create table hive.web2.page_views;
Create Table
--------------------------------------------------
CREATE TABLE hive.web.page_views (
view_time timestamp,
user_id bigint,
page_url varchar,
view_date date,
country varchar
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['view_date','country']
)
(1 row)
If you are comfortable with SQL, then you probably know about the CREATE TABLE AS
or (CTAS)
statement which can be used to create a new table that contains the results of a SELECT
query. The column definitions for the table are created dynamically by looking at the result column data from the query. The statement can be used for creating temporary tables in Presto or as part of some process to create transformed tables.
CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]
[ COMMENT table_comment ]
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]
By default, the new table will be populated with the result data from the query. For example, you may use CTAS
to create and load unpartitioned data in TEXTFILE
format into a partitioned data in ORC format.
CREATE TABLE hive.web.page_views_orc_part
WITH (
format = 'ORC',
partitioned_by = ARRAY['view_date','country']
)
AS
SELECT *
FROM hive.web.page_view_text
Additionally, you may use CTAS
for transforming data. Below is an example of creating a table from the resulting sessionization query over the page_views
table.
CREATE TABLE hive.web.user_sessions
AS
SELECT user_id,
view_time,
sum(session_boundary)
OVER (
PARTITION BY user_id
ORDER BY view_time) AS session_id
FROM (SELECT user_id,
view_time,
CASE
WHEN to_unixtime(view_time) -
lag(To_unixtime(view_time), 1)
OVER(
PARTITION BY user_id
ORDER BY view_time) >= 30
THEN 1
ELSE 0
END AS session_boundary
FROM page_views) T
ORDER BY user_id,
session_id
Occasionally you may with to use CTAS
as a way to simply copy the table definition without actually copying the data itself. You can do this by adding WITH NO DATA
clause at the end of the statement.
Using the ALTER TABLE statement in Presto you can perform actions such as renaming a table, adding a column to the table, dropping a column from a table, and renaming a column in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMN column_name data_type
[ COMMENT comment ] [ WITH ( property_name = expression [, ...] ) ]
ALTER TABLE name DROP COLUMN column_name
ALTER TABLE name RENAME COLUMN column_name TO new_column_name
It’s important to note that depending on the connector and authorization model for the connector, these operations may not be allowed using the default behavior. For example, the Hive connector restricts these operations by default. For more information on authorization, refer {{Chapter X}}.
Using the DELETE
statement in Presto you can delete rows of data from a table. The statement provides and optional WHERE
clause to restrict which rows are deleted. If the WHERE
clause is omitted, then all the data may be deleted from the table.
DELETE FROM table_name [ WHERE condition ]
As with other DDL, various connectors may have limited or not support for the DELETE
statement. For example, the DELETE
statement is not supported (or makes much sense) with the Kafka connector. And the Hive connector only supports DELETE
if there is a WHERE
clause specifying a partition key which can be used to delete entire partitions.
DELETE FROM hive.web.page_views
WHERE view_date = DATE '2019-01-14' AND country = 'US'
Using the DROP table statement, you can drop the table.
DROP TABLE [ IF EXISTS ] table_name
Depending on the connector implementation, this may or may not drop the underlying data. You should refer to the connector documentation for further explanation.
So far in this chapter we’ve gone over the various DDL statements Presto supports. However, it does not mean that every data source in Presto supports the DDL or provide the same semantics. This is determined by the connector implementation and the capabilities of the underlying data source.
If you try a statement or operation that is not supported by a particular connector Presto will return an error. For example, the System connector is used to query information about the Presto system. It does not support or makes much sense to allow one to create tables.
presto:web> use system.runtime;
USE
presto:runtime> create table foo(a int);
Query failed: This connector does not support creating tables
Presto supports most of the data types described by the SQL Standard, which are also supported by many relational databases. In this section, we will discuss data types specific to Presto. However, as with the DDL not all Presto connectors support all Presto’s type. And Presto may not support all the types from the underlying data source of a connector either. It is also connector implementation dependent as to how the data types are translated to and from the underlying data source into Presto.The following tables describe the example data types in Presto and provide example data where applicable.
Type | Description | Example |
BOOLEAN |
Boolean value true or false |
True |
Type | Description | Example |
TINYINT |
8-bit signed integer. It has a minimum value of -2^7 and a maximum value of 2^7 - 1. | 42 |
SMALLINT |
16-bit signed integer. It has a minimum value of -2^15 and a maximum value of 2^15 - 1. | 42 |
INTEGER , INT |
32-bit signed integer. It has a minimum value of -2^31 and a maximum value of 2^31 - 1. INT is an alias for INTEGER. | 42 |
BIGINT |
64-bit signed integer. It has a minimum value of -2^63 and a maximum value of 2^63 - 1. | 42 |
Type | Description | Example |
REAL |
32-bit Floating-Point. Implemented by the IEEE Standard 754 for Binary Floating-Point Arithmetic. | 2.71828 |
DOUBLE |
32-bit Floating-Point. the IEEE Standard 754 for Binary Floating-Point Arithmetic. | 2.71828 |
Type | Description | Example |
DECIMAL | A fixed precision decimal number. See section in Decimal for an in depth description. | 1234567890 |
Type | Description | Example |
VARCHAR VARCHAR(n) |
A variable length string of characters. There is an optional maximum length when defined as VARCHAR(n), where n is a positive integer representing the maximum number of characters. | “Hello World” |
CHAR CHAR(n) |
A fixed length string of characters. There is an optional length when defined as CHAR(n) . Where n is a positive integer defining the length of character. CHAR is equivalent to CHAR(1) . |
“Hello World “ |
VARBINARY |
A variable length binary byte string. |
Unlike VARCHAR
, CHAR
will always have n
characters. Here are some behavior and errors you should be aware:
If you are trying to cast a character string that has less characters than n
, trailing spaces will be added.
If you are trying to cast a character string has more than n
characters, it will be truncated.
If you attempt to insert into a table a VARCHAR
or CHAR
longer than as defined in the column, an error will occur.
If you attempt to insert into a table a CHAR
that is shorter than as defined in the column, the value will be space padded to match the defined length.
If you attempt to insert into a table a VARCHAR
that is shorter than as defined in the column, the exact length of the string is stored. Leading and trailing spaces are included when comparing CHAR
values.
The below examples highlight these behaviors:
presto> select length(cast('hello world' as char(15)));
_col0
-------
15
(1 row)
presto> select cast('hello world' as char(15)) || '~';
_col0
------------------
hello world ~
(1 row)
presto> select cast('hello world' as char(5));
_col0
-------
hello
(1 row)
presto> select length(cast('hello world' as varchar(15)));
_col0
-------
11
(1 row)
presto> select cast('hello world' as varchar(15)) || '~';
_col0
--------------
hello world~
(1 row)
presto:default> select cast('hello world' as char(15)) = cast('hello world'
as char(14));
_col0
-------
false
(1 row)
presto:default> select cast('hello world'
as varchar(15)) = cast('hello world' as varchar(14));
_col0
-------
true
(1 row)
presto:default> create table varchars(col varchar(5));
CREATE TABLE
presto:default> insert into varchars values('1234');
INSERT: 1 row
presto:default> insert into varchars values('123456');
Query failed: Insert query has mismatched column types: Table: [varchar(5)],
Query: [varchar(6)]
The following tables describe temporal data types. Temporal data types are related to time. Coarsely, these data types could be timestamps, dates, time, or intervals.
Type | Description | Example |
DATE |
A Calendar date representing the year, month, and day. | DATE `1983-10-19’ |
TIME |
A time of day representing hour, minute, second, and millisecond, but not including a time zone. | TIME '02:56:15.123' |
TIME WITH TIME ZONE |
A time of day representing hour, minute, second, and millisecond, including a time zone. | See examples below. |
TIMESTAMP |
A date and time without a time zone. | See examples below. |
TIMESTAMP WITH TIMEZONE |
A date and time with a time zone. | See examples below. |
INTERVAL YEAR TO MONTH |
An interval span of years and days. | INTERVAL ‘1-2’ YEAR TO MONTH See section below on Intervals. |
INTERVAL DAY TO SECOND |
An interval span of days, hours, minutes, seconds, and milliseconds. | INTERVAL ‘5’ DAY to SECOND See section below on Intervals. |
In Presto, TIMESTAMP
is represented as a Java Instant2 type representing the amount of time before or after the Java epoch. This should be transparent to the end user as values are parsed and displayed in a different format.
For types that do not include TIMEZONE
information, the values are parsed and displayed according to the Presto session time zone. For types that include the TIMEZONE
information, the values are parsed and displayed using the TIMEZONE
.
String literals can be parsed by Presto into a TIMESTAMP
, TIMESTAMP WITH TIMEZONE
, TIME
, TIME WITH TIMEZONE
, or DATE
. The following tables describe the formats accepted by Presto for parsing. If you want to use ISO 8601, you can use the from_iso8601_timestamp
or from_iso8601_date functions
. See {{Chapter X}} on Functions in the Presto for more information.
TIMESTAMP
|
TIMESTAMP WITH TIMEZONE
|
yyyy-M-d |
yyyy-M-d ZZZ |
yyyy-M-d H:m |
yyyy-M-d H:m ZZZ |
yyyy-M-d H:m:s |
yyyy-M-d H:m:s ZZZ |
yyyy-M-d H:m:s.SSS |
yyyy-M-d H:m:s.SSS ZZZ |
TIME
|
TIME WITH TIMEZONE
|
H:m |
H:m ZZZ |
H:m:s |
H:m:s ZZZ |
H:m:s.SSS |
H:m:s.SSS ZZZ |
DATE
|
YYYY-MM-DD |
When printing the output for TIMESTAMP
, TIMESTAMP WITH TIMEZONE
, TIME
, TIME WITH TIMEZONE
, or DATE
, Presto will output in the formats below. If you want to output in strict ISO 8601 format then you can use the to_iso801
function. See {{Chapter X}} on Function in the Presto for more information.
Data Type | Format |
TIMESTAMP |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIMESTAMP WITH TIMEZONE |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIME |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
TIME WITH TIMEZONE |
yyyy-MM-dd HH:mm:ss.SSS ZZZ |
DATE |
YYYY-MM-DD |
Timezone adds important additional information the type. Timezones are defined by IANA3 or via the UTC offset time. While Presto supports TIME WITH TIME ZONE
, it’s often best to use this in conjunction with a DATE
or use TIMESTAMP
. This is because the daylight savings times should be accounted for and a DATE is needed for that.
Example Timezones
America/New_York
America/Los_Angeles
Europe/Warsaw
+08:00
−10:00
The below examples highlight some of the behaviors described above:
presto> select TIME '02:56:15 UTC';
_col0
------------------
02:56:15.000 UTC
(1 row)
presto> select TIME '02:56:15 UTC' AT TIME ZONE 'America/Los_Angeles';
_col0
----------------------------------
18:56:15.000 America/Los_Angeles
presto> select TIME '02:56:15 UTC' AT TIME ZONE '-08:00';
_col0
---------------------
18:56:15.000 -08:00
(1 row)
presto> select TIMESTAMP '1983-10-19 07:30:05.123';
_col0
-------------------------
1983-10-19 07:30:05.123
(1 row)
presto> select TIMESTAMP '1983-10-19 07:30:05.123 America/New_York' AT TIME
ZONE 'UTC';
_col0
-----------------------------
1983-10-19 11:30:05.123 UTC
(1 row)
The data type INTERVAL
are either YEAR TO MONTH
or DAY TO SECOND
.
YEAR TO MONTH
|
INTERVAL ‘<years>-<months>’ YEAR TO MONTH |
INTERVAL ‘<years>’ YEAR TO MONTH |
INTERVAL ‘<years>’ YEAR |
INTERVAL `<months>’ MONTH |
DAY TO SECOND
|
INTERVAL ‘<days> <time>’ DAY TO SECOND |
INTERVAL ‘<days>’’ DAY TO SECOND |
INTERVAL ‘<days>’ DAY |
INTERVAL ‘<hours>’ HOUR |
INTERVAL ‘<minutes>’ MINUTE |
INTERVAL ‘<seconds>’ SECOND |
The below examples highlight some of the behaviors described above:
presto> select INTERVAL '1-2' YEAR TO MONTH;
_col0
-------
1-2
(1 row)
presto> select INTERVAL '4' MONTH;
_col0
-------
0-4
(1 row)
presto> select INTERVAL '4-1' DAY TO SECOND;
Query 20181218_215339_00075_8bmfb failed: Invalid INTERVAL DAY TO SECOND
value: 4-1
presto> select INTERVAL '4' DAY TO SECOND;
_col0
----------------
4 00:00:00.000
(1 row)
presto> select INTERVAL '4 01:03:05.44' DAY TO SECOND;
_col0
----------------
4 01:03:05.440
(1 row)
presto> select INTERVAL '05.44' SECOND;
_col0
----------------
0 00:00:05.440
(1 row)
Collection Data Type | Example |
ARRAY |
ARRAY[‘apples’, ‘oranges’, ‘pears’] |
MAP |
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]) |
JSON |
{"a":1,"b":2,"c":3} |
ROW |
ROW(1, 2, 3) |
In {{Chapter X}} on Functions and Operators, we will learn more examples of how to use these data types.
Presto also support custom data types that can be provided as a plugin to Presto. Refer to {{Chapter X}} about the Presto SPI for more detail.
Sometimes it is necessary to explicitly cast a value or literal to a different type. For this we can use the CAST function.
cast(value AS type)
Now let’s say you need to compare a DATE to a literal.
presto:web> SELECT *
-> FROM hive.web.page_views
-> WHERE view_date > '2019-01-01';
Query failed: line 1:42: '>' cannot be applied to date, varchar(10)
This query failed because Presto does not have a comparison operator “>” (greater than) that knows how to compare a DATE
and VARCHAR
. However, it has a comparison function that knows how to compare a DATE
to another DATE
. Therefore, we need to use the CAST
function to coerce one of the types. In this example, it makes the most sense to convert the literal to a DATE
.
presto:web> SELECT *
-> FROM hive.web.page_views
-> WHERE view_date > CAST('2019-01-01' as DATE);
view_time | user_id | page_url | view_data | country
-------------------------+---------+----------+------------+---------
2019-01-26 20:40:15.477 | 2 | http:// | 2019-01-26 | US
2019-01-26 20:41:01.243 | 3 | http:// | 2019-01-26 | US
...
It is actually possible to have coerced the DATE
column view_date to a VARCHAR
since there also exists a comparison operator that compares two VARCHAR
s. But it makes the most sense in this example to compare two DATE
s.
Presto also provides another conversation function try_cast
that will attempt to do the type coercion, but unlike cast
which will return an error if the cast fails, try_cast
will return a null value. This can be useful when an error is not necessary.
try_cast(value AS type)
Let’s take for example coercing a character literal to a number type.
presto:web> SELECT cast('1' as integer);
_col0
-------
1
(1 row)
presto:web> SELECT cast('a' as integer);
Query failed: Cannot cast 'a' to INT
presto:web> SELECT try_cast('a' as integer);
_col0
-------
NULL
(1 row)
The data types described above are the data types supported within the Presto query engine. However, given Presto’s connector base architecture, the underlying data sources may not support the same type or the same type may be named different. For example, the MySQL connector will map the Presto REAL
type to a MySQL FLOAT
. Sometimes the types need to be converted themselves where some connectors will convert an unsupported type into a Presto VARCHAR or it may ignore reading the column entirely. The connector implementation will know how to map and coerce the underlying data types to and from the data source with Presto. It’s best to refer to the connector’s documentation to best understand the type mapping behavior.