Chapter 5 – Physical Database Design

"If the facts don’t fit the theory, change the facts”

- Albert Einstein

The Four Stages of Modeling for Greenplum

1.Requirement Analysis involves figuring out what information the users need and their processing requirements. The output of this is the Business Information Model (BIM) which shows major entities and their relationships. This is sometimes referred to as the Business Model or the Business Impact Model.

2.Logical Modeling determines defining the tables and their relationships, and this should be independent of a particular database or indexing strategy. Greenplum works well with a Third Normal Form (3NF) model, and the output defines all tables and their columns and the Primary Key Foreign Key relationships between tables plus all constraints.

3.Activity Modeling involves working with the users to determine the volume, usage, frequency, and integrity analysis of a database. This process also consists of placing any constraints on domains and entities. This is called the Extended Logical Data Model (ELDM).

4.Physical Modeling - Using the Activity Model and transforming the logical model into a definition of the physical model that will accelerate queries with the best Distribution Keys and indexes that best suits a specific hardware and software configuration.

This module will concentrate a great deal on the Physical Modeling.

The Logical Model

This simple logical model shows the Primary Key and Foreign Key relationships between tables.

image

There are five tables above that have a relationship. That means they can be joined together with SQL to produce one report.

The Logical Model can be loaded inside Nexus

The Nexus Query Chameleon takes the ERwin logical model and loads it inside Nexus.
Now, users can see table relationships, click on what they want, and Nexus writes the SQL.

image

We understand that data in a data warehouse has tables, rows, and columns. In the old days these were called files, records, and fields. In logical modeling terms they are called relations, tuples, and attributes. The Nexus uses the logical model to allow users to see tables and all other tables in the relationship and then builds the SQL with each click of the user’s mouse.

First, Second and Third Normal Form

A Primary Key is usually the first column in the table and it is always unique and not null. The other columns in the table relate to the Primary Key.

A Foreign Key is a normal column in the table that is the connector to another table’s Primary Key, and thus they have a join relation. Think of the Employee_Table having a Primary Key of Employee_No and a Foreign key of Dept_No, and this table has a relation to the Department_Table that has a Primary Key of Dept_No.

First Normal Form (1NF) means that columns (attributes) must not repeat within a table. No repeating groups such as Monday_Sales, Tuesday_Sales, etc.

Second Normal Form (2NF) means that all columns relate to the entire Primary Key and not just a portion of the Primary Key. Columns can also relate to each other, such as Dept_No and Department_Name, by both being in the table. Tables with a single column Primary Key are always in Second Normal form.

Third Normal Form (3NF) means all columns must relate to the Primary Key and not to each other. This is often termed that all columns must relate to the key, the whole key, and nothing but the key so help you Codd.

The process of placing columns in the correct tables is called normalization. There are three major forms of normalization, and they are first, second, and third normal form.

Quiz – Choose that Normalization Technique

Below is a table with rows and columns.
What form of normalization is this example?

image

The process of placing columns in the correct tables is called normalization. There are three major forms of normalization, and they are first, second, and third normal form. What form of normalization is the table above?

Answer to Quiz – Choose that Normalization Technique

Below is a table with rows and columns.
What form of normalization is this example?

image

This is second normal form (2NF)

Although all columns relate to the Employee_No, the Department_Name is also a direct relation of the Dept_No. That makes this second normal form (2NF).

Quiz – What Normalization is it now?

image

Above are tables with rows and columns. What form of normalization is this example?

Answer to Quiz – What Normalization is it now?

image

Both table are in third normal form (3NF). All columns relate to the Primary Key, the whole key and nothing but the key. Tables are joined via the Dept_No column that resides in both tables.

The Employee_Table and Department_Table can be joined

image

The Nexus knows the join relationships between tables. Above you can see that the user sees both tables and they have clicked on the columns they want on the report. Turn the page and see the SQL that Nexus has built for them. Tables with a Primary Key Foreign Key relationship can be joined together with SQL.

The Employee_Table and Department_Table Join SQL

image

Tables with a Primary Key Foreign Key relationship can be joined together with the SQL you see above in the SQL tab of Nexus. Tools like Nexus can help users significantly. Having Nexus navigate the table relationships allows users to visually see the tables and their relationships. This allows Nexus to build the SQL.

The Extended Logical Model Template

image

You communicate with the end user community to see how they will access the data and you run queries on the data to understand the data demographics. The Logical Data Model will be the input to the Extended Logical Data Model. The Extended Logical Data Model will be input to the Physical Data Model.

User Access is of Great Importance

The first and most important part of a Greenplum Physical Model is choosing the distribution key of each table and any secondary indexes. The distribution key is the most important decision. If users are not querying a column then there is no reason to make it a distribution key or secondary index. Choosing a great distribution key makes access using that column a single segment operation and when you join two tables using the distribution key no data movement is involved.

image

Value Access Frequency - How often will this column be accessed in the SQL WHERE clause?

Join Access Frequency - How often will this column be used as the join column to another table?

Value Access Rows - The number of rows that will be accessed multiplied by the frequency at
which the column will be used.

The user access part of the template is important to understand how users will access data.

User Access in Layman’s Terms

image

Getting as many Single Segment Retrieves by using the Distribution Key in the WHERE clause is the essence of design in a Greenplum system. Only one segment is used and only one page is transferred into the segment's memory.

User Access for Joins in Layman’s Terms

We use templates to pick the best distribution key for a table. Joins are important in this decision.

image

Joins are the most taxing process in Greenplum because all matching rows must be joined on the same segment so most of the time Greenplum will redistribute one or both of the tables or duplicate the smaller table temporarily in spool to enable the join to happen. A brilliant technique is to find out the join criteria and make those columns the Distribution Key. No data has to move because the matching rows are already on the same segment.

The Nexus Shows Users the Table’s Distribution Key

image

Accessing tables and views using the distribution key is vital to a successful Greenplum implementation. The Nexus WHERE tab (in purple) shows users the Distribution Key and Secondary Indexes for all tables. Nexus also shows the underlying indexes when users are accessing views. Users can point and click on the index and Nexus builds the SQL.

Data Demographics Tell Us if the Column is Worthy

image

Distinct Rows – How many distinct values are in this column? Will it spread the data evenly?

Max Rows Null – How many rows in this column are Null?

Max Rows Per Value – What value is the most popular and how many values does it have?

Typical Rows Per Value – What’s the typical amount of duplicate values for the column?
Looking at the Max Rows Per Value Vs. the Typical Rows Per Value shows us the skew factor.

Change Rating – How often will the column be updated and the value change in this column?
If the column changes values all the time it must be eliminated as a distribution key candidate.

There are distribution keys and secondary indexes. Data demographics give us a picture of the data so we can tell if the column is worthy of a distribution key or a secondary index.

Data Demographics – Distinct Rows

image

Distinct Rows – How many distinct values are in this column? Will it spread the data evenly?

If the rows are not unique we can still get an idea if it will spread the data evenly if it is chosen as a distribution key. We need to make sure that there are more distinct values than segments if this is to be considered a distribution key candidate.

Above we can see that this table has 20000 rows and that Emp_No has 20000 distinct rows so we know it is a unique column. Dept_No has only 5000 distinct values so it is still a candidate.

The key to Distinct Rows is to first see if the column is unique. If it is unique we can then consider it for a distribution key or a Unique Secondary Index.

Data Demographics – Distinct Rows Query

SELECT

COUNT (DISTINCT (employee_no)) AS Emp_Count

,COUNT (DISTINCT (dept_no))          AS Dept_Count

,COUNT (DISTINCT (last_name))       AS Last_Count

,COUNT (DISTINCT (first_name))     AS First_Count

,COUNT (DISTINCT(salary))              AS Sal_Count

FROM employee_table

image

Above is a query you can run to get the distinct values for each column.

Data Demographics – Max Rows Null

image

The key to Max Rows Null is to get an idea if the column contains Null values. If there are no Null values we should specify NOT NULL in the table CREATE statement. If there are Null values then those will hash to a single segment if the column is chosen as the distribution key thus potentially causing skew. If there are more than one Null then we can’t have that column become a UNIQUE constraint or UNIQUE Index.

A few Nulls are not a big problem, but a lot of Null values are a problem as they could cause skew if the column is chosen as the distribution key.

Data Demographics – Max Rows Null Query

SELECT

COUNT (*) as Max_rows_Null

FROM employee_table

WHERE Dept_No is NULL ;

 Max_rows_Null  

1

Above is a query you can run to get the Max Rows Null values for each column.

Data Demographics – Max Rows Per Value

image

Max Rows Per Value tells us many things. First of all it lets us know if the column is a candidate for a distribution key or a Secondary Index. Emp_No has only one Max Rows Per Value so it is unique. The next thing the Max Rows Per Value tells us is if there will be a spike in our distribution if this column is a distribution key. For example, Last_Name has a Max Rows Per Value of 1000. This means that a Last_Name like ‘Smith’ has 1000 entries. This is a huge spike because if Last_Name is the distribution key then one segment will hold all 1000 Smiths.

The Max Rows per Value should also be compared to the Typical Rows Per Value to see just how skewed the data will be if the column is chosen as the distribution key. Comparing these two values will let you know if there will be a spike on one segment or if that generally the data will be distributed fairly evenly.

Data Demographics – Max Rows Per Value

SELECT Dept_No

,COUNT(*)

FROM Employee_Table

GROUP BY Dept_No

ORDER BY 2 DESC ;

Dept_No 

 Count  

400

3

200

2

100

1

10

1

300

1

?

1

Above is a query you can run to get the Max Rows per Value for each column.

Data Demographics – Typical Rows Per Value

image

Typical Rows Per Value tells us many things. First of all it lets us know if the column is a candidate for a Unique distribution key or a Unique Secondary Index. It can also be compared to the Max Rows Per Value to show how evenly or how skewed the data will be if this column is chosen as the distribution key. It is also a guideline for a Secondary Index. It is important that a Non-Unique Secondary Index is strongly selective. In other words if there are typically a lot of duplicates then it is weakly selective. It is like asking everyone in the room if they breathe.

Notice Last_Name has a Typical Rows per Value of three and a Max Rows per Value of 1000. This means it might not be a good candidate as a distribution key because of the spike of 1000, but it has a low Typical Rows per Value which means it is strongly selective as a Non-Unique Secondary Index (NUSI). Only three rows per Last_Name are useful for a NUSI search.

Typical Rows Per Value Query For Greenplum Systems

WITH Derived_Table (Prod_ID, County) AS

(

SELECT Col1, Counter

FROM

(SELECT Product_ID

,COUNT(*)

FROM    Sales_Table

GROUP BY 1) as TeraTom (Col1, Counter))

SELECT

CAST(SUM(County) / Count(Prod_ID) as INTEGER)

as "Product_ID Typical Rows per Value"

FROM Derived_Table ;

Product_ID Typical Rows per Value 

7

Above is a query you can run to get the Typical Rows per Value for each column. The column we chose to get the Typical Rows per Value was the Product_ID column in the Sales_Table. All you need to do is change the table name and the column name and you are all set. This is also referred to as the Median Value.

SQL to Get the Average Rows Per Value for a Column (Mean)

SELECT 'Product_ID' as "Column Name"

,COUNT(*)/COUNT(Distinct(Product_ID))

as "Avg Rows Per Value"

FROM     Sales_Table ;

Column Name

Avg Rows Per Value

Product_ID

7

Above is a query you can run to get the Average Rows per Value for each column. The column we chose to get the Average Rows per Value this time was the Product_ID and then the Sale_Date column in the Sales_Table. I wanted to show you two example so you could better understand the changes you will need to make to get this query working for each column. Just replace the table name and the column you want to choose and you can get the Mean Value.

Data Demographics – Change Rating

image

Change Rating is a rule of thumb metric that you have to guesstimate sometimes. How often will a column be updated and change its value? If a column changes values a lot, then it will be too disruptive to be a distribution key. A distribution key shouldn’t change very often. A Secondary Index can change a little more, but even that should have a low change rating. Each time a distribution key column changes, the row needs to move to a new segment, and each secondary index subtable needs to be updated. That can be too much overhead.

Notice that each change rating is pretty low so each of these could be a distribution key or a secondary index candidate.

Factors When Choosing Greenplum Indexes

How will Greenplum use the index?

How much space will the index require?

What type of table is it (Append-Only, Heap, Clustered index, Partitioned, Columnar)?

How many rows are in the table?

How many columns are in the table?

What are the table protection types?

Does the table have referential integrity?

What column(s) are most frequently used in the WHERE clause to access rows in the table?

What are the number of distinct values in a column?

What are the maximum rows per value Vs. the typical rows per value?

Is the table accessed via a Full Table Scan, through column values by a Join?

How many tables will join to this table on average?

Is the table used for DSS, reports, batch, OLTP, analytics, Ad Hoc queries or a combination?

What are the number of INSERTS and when will they occur?

What are the number of DELETEs and when will they occur?

What are the number of UPDATEs and when will they occur?

How often will an UPDATE cause a column to change values?

How are transactions written (single statement, multi-statement)?

What level and type of locking will a query or transaction require?

How long will a transaction hold locks?

How normalized is the data model?

Above are the factors to consider when choosing indexes in Greenplum.

Distribution Key Data Demographics Candidate Guidelines

image

The distribution key Candidate Guidelines should consider the Primary Key of a table and any unique columns. The candidate should also be any single column with High Distinct Values, a low maximums for Null values and should provide good distribution so it should have a Typical Rows that is close to the Max Rows so there are no spikes in the segments.

Distribution key Access Considerations

There are three important functions of the distribution key. They are:

1.Data Distribution

2.Fast Data Access

3.Joins

What should be your priority as a Greenplum expert? All three are important, but how should you approach your decision? The next page will reveal the answer.

Of the three important functions listed above which one do you think is the most important consideration? List from first to last which priority should be considered first, second and third.

1)_____________

2)_____________

3)_____________

Check your answers on the next page.

Answer -Three Important distribution key Considerations

Joins – it is the joins of tables that is the most taxing on a Greenplum system so understanding how to enhance joins can do more for your system than anything else. This is your biggest opportunity! How does Greenplum join five tables together? Greenplum always joins tables two at a time because the matching rows must be on the same segment. If the rows are not on the same segment Greenplum must redistribute or duplicate to get the matching rows together. The only thing that designates where an segment is physically is the distribution key. A brilliant move is to consider tables that are joined together often and to make the join columns (PK/FK) the distribution key on both tables. This will enhance the joining process to perfection.

Data Access – Using the distribution key in your SQL WHERE clause is lightning fast for data retrieval, but you have the option of quick access with secondary indexes (especially a Secondary Index ). That is why joins should be your first consideration. However, if you have hundreds of users who need split second response times (called tactical queries) to certain tables then Data Access is your top opportunity to make that happen.

Data Distribution – Most newbies to Greenplum focus on data distribution as there most important consideration because it is the first and most fundamental thing they learn about the distribution key. If the column you are considering will cause bad data distribution then you have to take it off the table. Don’t start with distribution and let it bias you.

Tera-Tom priorities are in the order above, but if distribution is poor then start again.

Step 1 is to Pick All Potential Distribution Key Columns

image

Step 1 is to Pick All Potential Distribution Key Columns

image

Step 2 is to Pick All Potential Secondary Indexes

image

Place an SI for any column that you consider a possible Secondary Index candidate. SI stands for Secondary Index.

Answer to 2nd Step to Picking Potential Secondary Indexes

image

Now you are ready to make some decisions.

Choose the Distribution Key and Secondary Indexes

1)Will this table do a lot of joins or will users need to access just this table fast? Which will happen more and be a bigger priority?

2)Is the table big or small. If it is small then joins won’t be a problem. If it is big they could be. How often will users need to access just this table in sub second time?

3)What does the data distribution look like? Are there a lot of nulls? Will the data skew badly?

4)How often will the data change for this column? Eliminate it as a distribution key candidate if it will change a bit and eliminate it as a Secondary Index if it will change somewhat often.

5)A Secondary Index is usually a potential choice for Access, but it must be strongly selective. That means there must be many different distinct values.

6)Time to now decide on your Distribution Key and Secondary Indexes.

It is time to make a first run at choosing the Distribution Key and any Secondary Indexes.

3rd Step is to picking your Indexes

image

It is now time to pick your distribution key and any secondary indexes. Good luck! We will show you our choices on the following page.

Our Index Picks

image

This was a tough call. Because of the large amount of joins and the reasonable distribution we chose Dept_No as our distribution key and Emp_No as a secondary index. Switching these could have also worked out, but now our joins will be fast and all access using the Emp_No column to search will be able to use a highly selective secondary index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset