“Beware of the young doctor and the old barber.”
- Benjamin Franklin
Above, we are converting all of the tables in a Teradata database to Redshift table structures. We went to our Teradata system and right clicked on the database SQL_Class and chose "Convert Table Structures". We selected all of the tables and hit the blue arrow. We then chose to convert to Redshift. Watch in amazement what happens next!
All 20 Teradata tables have now been converted to Redshift. Just cut and paste to your Redshift system, and you have converted the tables.
1. Choose the best sort key
2. Choose a great distribution key
3. Consider defining primary key and foreign key constraints
4. Use the smallest possible column size
5. Use date/time data types for date columns
6. Specify redundant predicates on the sort column
“I have found the best way to give advice to your children is to find out what they want and then advise them to do it.”
--Harry S. Truman
As you design your database, there are important decisions you must make that will heavily influence overall query performance. These design choices also have a significant effect on how data is stored, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process certain queries. Harry S. Truman was right. "If you want your Redshift system to run brilliantly, take advice from your users, and use best practices to deliver what they asked for".
When you give an Amazon Redshift table a sort key, it stores your data on disk in sorted order.
The sort order is used by the optimizer to determine optimal query plans.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
If you do range filtering or equality filtering on one column, specify that column as the sort key.
If you frequently join a table, specify the join column as both the sort key and the distribution key.
Data sorted correctly helps eliminate unneeded blocks. This is because Redshift has metadata on each block showing column min and max values.
When you give an Amazon Redshift table a sort key, it stores your data on disk in sorted order. The sort order is used by the optimizer to determine optimal query plans. If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table Where most databases use indexes to determine where data is, Redshift uses the block’s metadata to determine where data is NOT!
Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks they own.
There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a (dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the Order_Date column. Look how the data is sorted!
When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But, as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region on disk. In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals. You will also need to run ANALYZE.
Good data distribution has two goals:
1. To distribute data evenly among the nodes and slices in a cluster.
2. To collocate data for joins and aggregations.
Uneven distribution, or data skew, forces some nodes to do more work than others which slows down the entire process. With parallel processing, a query is only as fast as the slowest node. Even distribution is a key concept when each node processes the information they own simultaneously with their node peers.
When rows that participate in joins or aggregations are located on different nodes, more data has to be moved among nodes. This is because Amazon Redshift must ensure that two rows being joined are on the same node in the same memory. If this is not the case, then Redshift will either copy the smaller table to all nodes temporarily or redistribute one or both tables.
The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good distribution!
Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No = Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This makes joins efficient and fast.
Notice that the Department_Table has only four rows. Those four rows are copied to every slice. This is distributed by ALL. Now, the Department_Table can be joined to the Employee_Table with a guarantee that matching rows are co-located. They are co-located because the smaller table has copied ALL of its rows to each slice. When two joining tables have one large table (fact table) and the other table is small (dimension table), then use the ALL keyword to distribute the smaller table.
1. Define primary key and foreign key constraints between tables wherever appropriate.
2. Primary key and foreign key constraints are informational only.
3. Amazon Redshift does not enforce unique, primary key, and foreign key constraints.
4. The query planner uses these keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins.
5. Amazon Redshift does enforce NOT NULL column constraints.
Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. Your application is responsible for ensuring uniqueness and managing the DML operations. The query planner will use primary and foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins. The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.
The query planner uses referential integrity in certain
situations, to infer uniqueness and for referential
relationships that affect subquery techniques, to order large
numbers of joins, and to eliminate redundant joins.
Amazon Redshift does not enforce primary key and foreign key constraints. The only reason to apply them is so the query optimizer can generate a better query plan.
You will improve query performance by
reducing columns to the minimum possible size.
Table size is not impacted, but query processing
will be if the column is being processed in a
temporary table to gather intermediate results.
Amazon Redshift compresses column data very effectively, so creating columns much larger than necessary has minimal impact on the size of data tables. It is in the processing of queries that the size can hurt you. This is because during processing for complex queries, intermediate query results might need to be stored in temporary tables. Because temporary tables are not compressed, unnecessarily large columns consume excessive memory and temporary disk space, which can affect query performance. Don't go overboard here! Don't get columns so small that they can't contain your largest values!
Use the DATE or TIMESTAMP data type rather than a character type when storing date/time information. Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results in better query performance. Let Amazon Redshift handle the DATE or TIMESTAMP conversions internally instead of you trying to do so in your applications. Most of the time users utilize CHAR or VARCHAR is in the ETL process of moving data. There is no need to do that because Redshift handles any conversions necessary.
We want to join Table1 with Table2. Both have a distribution key on
Customer_Number and both have a sort key on Order_Date.
SELECT T1.*, T2.Order_Number
FROM Table1 as T1
INNER JOIN
Table2 as T2
ON T1.Customer_Number = T2.Customer_Number
WHERE T1.Order_Date > '1/1/2014';
You should consider using a predicate on the leading sort column of the fact table, or the largest table, in a join. You can also add predicates to filter other tables that participate in the join, even when the predicates are redundant. These predicates refer to WHERE or AND clauses. Because Redshift has the max and min value for each column per block, you can get better performance when you choose a good sortkey. This allows Redshift to skip reading certain blocks because Redshift always checks the min and max values to see if the block should even be read. The second example above uses a redundant AND clause in hopes the entire table won't have to be read.
The above query aborts because it took longer than 10 milliseconds. The statement_timeout is designed to abort any statement that takes over the milliseconds specified. If the system setting WLM timeout (max_execution_time) is also specified as part of a WLM configuration, the lower of statement_timeout and max_execution_time is used.