Chapter 6 – Denormalization

“You know you're old if they have discontinued your blood type.”

- Phyllis Diller

Denormalization

A Greenplum data warehouse is designed to utilize either a Star Schema or a Third Normal Form model, but there can
potentially be thousands of users and often thousands of databases and tables. For performance reasons, after the model is complete and access and tendencies are understood, denormalizing can be a benefit.

Here are the fundamental denormalization techniques:

Derived Data

Repeating Groups

Pre-Joins

Summary and/or Temporary Tables

Partitioning (Horizontal or Vertical)

Make these choices AFTER completing the Logical Model.

Keep the Logical Model pure and untainted.

Keep the documentation of the physical model current.

I was discussing denormalization with the mastermind of the largest data warehouse in the world. He mentioned that they denormalize and I asked him why? He said, “We have users doing the same thing thousands of times a day so it saves us enormous time and money.” That is the real reason you would denormalize.

Derived Data

Response Time Requirements – Derived data can take a period of time to calculate while a query is running. If user requirements need speed, and their requests are taking too long, then you might consider denormalizing to speed up the request. If there is no need for speed, then be formal and stay with normal.

Access frequency of the request – If one user needs the data occasionally, then calculate on demand, but if there are many users requesting the information daily, then consider denormalizing so many users can be satisfied.

Volatility of the column – If the data changes often, then there is no reason to store the data in another table or temporary table. If the data never changes, and you can run the query one time and store the answer for a long period of time, then you may want to consider denormalizing.

The key factors for deciding whether to calculate or store stand-alone derived data are listed above.

Repeating Groups

Sales_Table at one of the largest data
warehouses in the world showing Daily_Sales.

image

The above table actually saves space because, if the table did not use repeating groups, there would be 7 times the rows and 14 extra bytes of row overhead per row. Instead of 150,000 rows, this table would have over 1 million rows.

The above table is not about saving space. The customer needs to compare how Monday did versus Tuesday, etc. This table allows users to get the information about a product during a specific week by reading only one row and not seven. Although this violates first normal form it can be a great technique for known queries.

Pre-Joining Tables

image

Pre-Joining tables makes sense if thousands of times a day they are joining.

Storing Summary Data with a Trigger

image

To keep the Order_Total updated in the Order_Table, a TRIGGER adds or subtracts the Order_Total when a line item is added or subtracted from the Order_Line_Item_Table.

Summary Tables or Data Marts the Old Way

image

Each night the Call_Billing_Monthly_Table builds the data mart holding the summary data. Now if a customer calls, the call center employee can quickly discuss their bill.

Horizontal Partitioning the Old Way

image

We have taken the Order_Table with Trillions of rows and broken it into 12 separate monthly Order tables to speed up queries on different months. That is the old way.

Horizontal Partitioning the New Way

image

Greenplum has Partitioned tables that allow you to partition rows.

Vertical Partitioning the Old Way

image

We have taken a table with 300 columns and split it into two tables. The first table contains the eight columns users query the most. The second table has the same Col1 (Primary Key) and the remaining 292 columns that aren’t queried that often. We can still join the two tables if we need more than just the first eight columns. That is the old way of doing it. Turn the page and see the new way.

Columnar Tables Are the New Vertical Partitioning

image

Greenplum now has Columnar Tables that partition the columns vertically into their own separate containers. This is designed to move less blocks into memory when users only query a few columns. Each column has its own separate block (container). Each segment still holds all columns (the entire row), but each column is separated in its own block so when users query the first eight columns only a little of the data moves.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset