Chapter 3 – Hashing and Data Distribution

“Speak in a moment of anger and you’ll deliver the greatest speech you’ll ever regret.”

- Anonymous

Distribution Keys Hashed on Unique Values Spread Evenly

image

When you have a Distribution Key by Hash and the values in that column are unique, the data will spread perfectly evenly in your Greenplum system. This is because the data values in the Hash Key use a hashing formula. The formula is designed to be consistent and spread the data evenly.

Distribution Keys with Non-Unique Values Spread Unevenly

image

When you have a Distribution Key by Hash and the values in that column are non-unique, the data will not spread evenly in your Greenplum system. This is because the data values in the Hash Key use a hashing formula. The formula is designed to be consistent, so all like values go to the same segment. Above, Dept_No is the Distribution Key.

Best Practices for Choosing a Distribution Key

The Distribution Key has three purposes:

1.Distributes the data among the segments

2.Fastest way to retrieve the data

3.Extremely important for joins

A Distribution Key is hashed and the data is distributed horizontally across the segments. All like values go to the same segment and cause skew so choose:

A column with high cardinality that does not have "skewed" data. The worst possible column would be "Gender" where there are only two values (F,M). These will be distributed to only two segments. The column should spread the data evenly or reasonably evenly.

A column that does not allow NULL, or where NULL values are as rare. Null values will hash to the same distribution and potentially cause skew.

An Integer column when possible. This is a small data type and usually has many different values, thus providing good distribution.

A column that is the join key with another table. Two rows being joined must be on the same segment, so two tables joining that have the same Join Key and Distribution Key will always be segment Local, thus providing quick joins.

Avoid using datetime columns. The only exception is when the time is accurate to at least the minute. Otherwise, all rows for a certain date are on one segment.

Above, are the best practices for choosing a good distribution key.

The Hash Map Determines which Segment owns the Row

image

When you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment. Above, we used the Emp_No column as the Distribution Key. Notice that the value for this row's Emp_No is 1001. The hash formula has been run against 1001 and it produced an answer of 13. If the formula was run again on 1001, it would always produce a 13 answer. After going to bucket 13 in the map, we can derive that this row will go to segment 1.

The Hash Map Determines which Node will own the Row

image

When you have a Distribution Key by Hash, every row is run through the hash formula and then sent to the proper segment. Emp_no 1001 hashes to a value of 13 and then the 13th bucket of the Segment map says to go to segment 1.

The Hash Map Determines which Node will own the Row

image

When you have a Distribution Key by Hash, every row is run through the hash formula and then sent to the proper segment. Emp_no 1002 hashes to a value of 5 and then the 5th bucket of the Segment map says to go to segment 2.

The Hash Map Determines which Node will own the Row

image

When you have a Distribution Key by Hash, every row is run through the hash formula and then sent to the proper segment. Emp_no 1003 hashes to a value of 9 and then the 9th bucket of the Segment map says to go to segment 3.

Hash Map Determines which Node will own the Row

image

When you have a Distribution Key by Hash, every row is run through the hash formula and then sent to the proper segment. Emp_no 1009 hashes to a value of 18 and then the 18th bucket of the hash map says to go to segment 3. Every row above was run through the hash formula and then placed on the segment that the Segment map designated. This is how Hash Distributed tables are distributed among the segments. Above, all the rows in the table has been distributed evenly. This is because the values were unique. This will always result in even distribution.

A Review of the Hashing Process

image

Hash the Distribution Key value for every row.

image

The output of the Hash Formula is a row hash.

image

Take the Row Hash and find its corresponding bucket in the
Segment Map.

image

Send the row to the segment listed in the Segment Map Bucket.

image

Each row is hashed with the hash formula and the answer is called a row hash value. That value looks at the bucket value in the Segment map and in that bucket resides a segment number. The row is then sent to the corresponding segment.

Non-Unique Distribution Keys have Skewed Data

Imagine if we made Last_Name the Distribution Key for a table. Here is an example of how it would distribute. Notice all duplicates have the same Row Hash.

image

The hash formula is consistent so like values get the same row hash and therefore go to the same segment. Choosing the right Distribution Key is of the utmost importance. Choosing the wrong Distribution Key will result in skewed data.

Non-Unique Distribution Keys have Skewed Data

Imagine if we made Last_Name the Distribution Key for a table. Here is an example of how it would distribute. Notice all duplicates have the same Row Hash.

image

The hash formula is consistent so like values get the same row hash and therefore go to the same segment. Choosing the right Distribution Key is of the utmost importance. Choosing the wrong Distribution Key will result in skewed data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset