Chapter 10 - Compression

“If you aren't compressing your Teradata tables, you are simply leaving money on the table.”

- Tera-Tom Coffing

Important Information about Compression

The COMPRESS clause works in two different ways:

  • When issued by itself (without a value or values), COMPRESS causes all NULL values for that column to be compressed to zero space. The Presence Bit for Nulls is used so no cost.
  • When issued with value (e.g., COMPRESS ‘Ohio’, ‘Iowa’, ‘Georgia’), the COMPRESS clause will compress every occurrence of Ohio, Iowa, and Georgia in that column to zero space as well as cause every NULL value to be compressed. You can compress up to 255 values per column, and the cost is between 1 and 8 bits per row of additional cost.

You CANNOT compress the following:

• Components of the primary index

• Identity columns

• Volatile tables

• Derived tables

• Columns defined with a UDT, Period, Geospatial, BLOB, or CLOB data type.

• Both the referencing and referenced columns with Standard RI (Referential Integrity) cannot be compressed, but the Batch and Soft RI columns can be compressed.

You can compress a value as large as 255 characters, but in Teradata 13.10, that increases to 510 characters. Before Teradata 13.10, only fixed width columns could be compressed.

Presence Bytes are also used for Compression

Presence Bytes are also used in Multi-Value Compression. Here are the basics of how this compression works. Teradata can place the top 255 most frequent values (per column) in the Table Header, and then up to 8 presence bits can be placed in front of each row of the table (per column) to determine the value inside the row. But that value no longer exists in the row.

images

The above pictorial shows (logically) a normal table and then that same table compressed.

Why One Byte (8 bits) can represent up to 255 Values

Below, is one byte (8 bits). The first bit represents the number 1. The second bit represents the number 2. The third bit represents a 4. These bits continue to double in value. If a bit is turned on with a 1, then it is counted. But if it is a zero, it is not counted. This is the binary system of zeros and ones. One Byte (8 bits) can be used to represent any number from 0 to 255.

images

Answers to One Byte (8 bits) can represent up to 255 Values

Below is one byte (8 bits). The first bit represents the number 1. The second bit represents the number 2. The third bit represents a 4. These bits continue to double in value. If a bit is turned on with a 1, then it is counted. But if it is a zero, it is not counted. This is the binary system of zeros and ones. These 8 bits can be used to represent any number from 0 to 255.

images

Now that you Understand that 8 Bits can Represent 0 – 255

Notice in our compressed example, (on the right) Teradata places one byte (8 bits) in front of the row. Then, Teradata places the State_Cd “California” inside the Table Header. Then, each time the actual row has a State_Cd equal to “California” they set the bits = 1. This is because “California” is the first value for the column compressed in the Table Header. If the presence bits are set to 0, then Teradata knows the value was not compressed and still resides in the row.

images

A Compression Example that Compresses Two Values

Notice in our compressed example (on the right), Teradata places one byte (8 bits) in front of the row. Then, Teradata places the State_Cd of “California” and “Oregon” inside the Table Header. Then, each time the actual row has a State_Cd equal to “California” they set the bits = 1. This is because “California” is the first value for the column compressed in the Table Header. If the presence bits = 2, they know it is Oregon. If equal to zero, they know the value resides in the row.

images

A Compression Example that Compresses Three Values

Notice in our compressed example (on the right), Teradata places one byte (8 bits) in front of the row. Then, Teradata places the State_Cd of “California” and “Oregon” inside the Table Header. Then, each time the actual row has a State_Cd equal to “California” they set the bits = 1. This is because “California” is the first value for the column compressed in the Table Header. If the presence bits = 2, they know it is Oregon. If equal to zero, they know the value resides in the row.

images

Quiz – Name that Compression Value

In the slide below on the right, you get the opportunity to name the value inside the row. Use your binary skills, the Table Header, and the Presence bits to put in the correct value! Good luck.

images

The Next Important Concept in Compression

The next two concepts are vital to getting a full understanding of compression. I will take these concepts in simple steps. The first is easy, but the second is amazing. Here is the first. Teradata can compress many columns in a table! Check out the Table Header and the column values!

images

Quiz – Can you Fill in the Compression Values?

Use your binary skills, the Table Header, and both the Provider bits and the State_Cd bits to show your intelligence. If you fill this out correctly, you only have one more concept to understand!

images

Answer – Can you Fill in the Compression Values?

Use your binary skills, the Table Header, and both the Provider bits and the State_Cd bits to show your intelligence. If you fill this out correctly, you only have one more concept to understand!

images

The Last Major Concept in Compression

You have made it to the last and most amazing concept of Teradata Multi-Value compression. If you have understood the exercises so far, you are fully ready to understand this concept. I will hammer this one home with additional exercises because you are almost at genius level.

Teradata knows from the Table Header that State_Cd has only three compressed values and the same goes for Provider. A number ranging from 0-3 in binary can be represented with 2 bits.

With one Presence Byte (8 presence bits), the State_Cd uses the first 2 bits, the Provider uses the next 2 bits, and we still have four bits left over that are not even used in a single presence byte!

images

Quiz – Using One Presence Byte for Multiple Columns

Can you fill in the State_Cd and the Provider using just one Presence Byte, the Table Header, and your ever- growing binary skills? I know you can.

images

Answer – Using One Presence Byte for Multiple Columns

If you were able to complete this assignment, welcome to genius level! What took you so long?

images

Quiz – How Many Presence Bytes are Needed?

Your mission is to look at the Table Header and list how many presence bytes will be needed.

images

Answer – How Many Presence Bytes are Needed?

Your mission is to look at the Table Header and list how many presence bytes will be needed.

images

There will be two Presence Bytes needed.         00000000 00000000

Why 2 Presence Bytes? We need to cover 15 state values, so that takes 4 bits. We need to cover 15 providers, so that takes another 4 bits, and we need to handle all 255 services and that takes 8 bits. So, in total, we need 2 Presence Bytes (16 Presence Bits).

Advanced Quiz – Fill in the Presence Bits?

images

The time has come to put the Nullable columns and the Compression together because that is how it works. For every Nullable column, a Presence Bit is defined and set to a 1 if that row has a Null value in that column for that particular row. A Presence Bit(s) are also defined for any column with Null Values. Your job is to set the right bits for both Nulls and Compression.

Answer to Advanced Quiz – Fill in the Presence Bits?

images

The time has come to put the Nullable columns and the Compression together because that is how it works. For every Nullable column, a Presence Bit is defined and set to a 1 if that row has a Null value in that column for that particular row. A Presence Bit(s) are also defined for any column with Null Values. Your job is to set the right bits for both Nulls and Compression.

The Cost Vs. the Savings

OK! There are still a few things to learn, but you are on the Tera-Tom team because you proved yourself. I am proud you have arrived. Now comes another very important concept. This is your first introduction to Cost Vs. Savings. This concept will also be hammered home with advanced exercises later, but for now just understand exactly how we saved 312 Bytes!

images

The Cost List of Compression

Below is the cost list for compression. At a maximum, you can compress up to 255 values per column. Each value is placed in the table header, and then a maximum of up to 8 bits can be placed in front of each row of the table (per column). If you have two columns being compressed, then up to 16 bits can be placed in front of each row, etc.

     1 Value       1 bit
     3 Values       2 bits
     7 Values       3 bits
  15 Values       4 bits
  31 Values       5 bits
  63 Values       6 bits
127 Values       7 bits
255 Values       8 bits

images

The above pictorial uses only three bits to represent up to seven values.

A Deeper Dive Into NULL Values

images

You know that a Presence Bit is reserved for every column that potentially could be Null. Did you know that, although Teradata sets the Presence Bit to a 1 for a Null, it still space for that data type? When you compress a column with Multi-Value compression, you can compress up to 255 values plus Nulls. When you compress a column and a Null Value is present, the Null presence bit is set to 1 and Teradata removes the additional space in the column. If you don't compress the column, the Null values still take up space. In our example above, the presence bits are set to 1 for a row's column with Nulls, but the space is still reserved.

Quiz - How Much Space Did We Just Save?

images

How much space did we just save inside the row for the four rows in this table, and what was the cost? ___________________    ____________________

Did we save 20 bytes on each Null for Last_Name? (Y/N) _____

What color is the presence bit for the compressed Nulls in Last_Name? __________

Answer - How Much Space Did We Just Save?

images

How much space did we just save inside the row for the four rows in this table, and what was the cost? 80 Bytes Savings and No Cost (The first presence byte is always there and it had room).

Did we save 20 bytes on each Null for Last_Name? (Y/N) Y

What color is the presence bit for the compressed Nulls in Last_Name? Blue

Advanced Quiz - How Much Space Did We Just Save?

images

What was the Savings? ______________

What was the Cost? ______________

How much did you Save (Savings – Cost)? _____________

How much would you have saved if you only compressed 31 values and it compressed 500 rows? _____________

Advanced Quiz - How Much Space Did We Just Save?

images

What was the Savings? 40,000 Bytes

What was the Cost? 30,000 Bytes (an extra presence Byte was needed)

How much did you Save (Savings – Cost)? 10,000 Bytes (40,000 savings – 30,000 cost) = 10,000

How much would you have saved if you only compressed 31 values and it compressed 500 rows? 30,000 Bytes (No additional presence byte would have been needed so no cost) 1500*20=30000

Using the DBC Tables in a Compression Experiment

We would like to test the Teradata Multi-Value Compression feature and use the DBC tables to compare the sizes of a table before and after Compression takes place.

Compression takes the top 255 values (plus Nulls) and removes those values from the actual row and instead places those values once in the table header (on each AMP).

Each time a table is built, a Table Header is built on every AMP. Then, when rows are loaded those rows are loaded into data blocks. When an AMP reads or writes data to a table, it places the Table Header and the blocks inside FSG cache for processing.

The gain for compression is that the column no longer stores the top 255 values and therefore the table is smaller, faster to process, and there is no overhead in decompressing the table because the blocks and Table Header reside in FSG Cache.

The cost is adding the column's top 255 values to the Table Header (minor cost), but the real cost is that each row adds up to 8 bits to the front of the row in order to identify the value associated with the row based on that value in the header.

If only a small percentage of the columns are part of the 255 values, the cost is high. This test will allow you to understand compression and the cost vs. the gain.

A Compression Test

I decided to run a series of tests on our SmartCompress feature in our Nexus Query Chameleon product. I highly suggest you use a proven tool like SmartCompress for your Teradata compression needs. Here is how the first couple of experiments worked.

Eight tables were created in order to test the compression of a CHAR (20) column. All eight tables looked exactly the same. They each had a Primary Index Column which was an integer and the Primary Index. Then, each had a single CHAR (20) column in which we would load data and then compress. Here is the DDL of the table.

CREATE Multiset Table NonCompressed.Char20_001val_1K_rows

     (  AChar               INTEGER

         ,Last_Name       Char(20)

      )primary Index  (Achar);

All eight tables were exactly the same except the name. Our concept is to load 1,000 rows into each of the eight tables. We would load 1 value into table 1, 3 values into the next table, 7 values into the next, 15, then 31, then 63, 127, and finally 255 values in the last table.

Once the tables are loaded via a stored procedure, we then could copy those eight tables into another database called Compressed. We would then compress the tables in the Compressed database and compare the savings with their counterparts in the Noncompressed database.

A Compression Test

CREATE PROCEDURE NonCompressed.Insert_Char_Tables_3Values ()

BEGIN

--Here is an example of the loop for the 3 Values inserted into our 1,000 row table

Declare Counter3 INTEGER default 0;
Declare ValCounter3 INTEGER default 0;
Declare PICounter3 INTEGER default 0;

Val3:LOOP
  Set Counter3     = Counter3 + 1;
  Set PICounter3  = PICounter3 + 1;
  Set ValCounter3 = ValCounter3 + 1;
   IF ValCounter3 = 4 Then Set ValCounter3 = 1;
     END IF;
    IF Counter3 = 1001 THEN LEAVE Val3;
     END IF;
         INSERT INTO NonCompressed.Char20_003val_1K_rows VALUES (PICounter3, ValCounter3);
   END LOOP Val3;
END;

This type of logic was used to load all eight tables with 1,000 rows so we could have eight tables with each table being able to compress either 1, 3, 7, 15, 31, 63, 127, or 255 values.

We then moved all Eight Tables to another Database

CREATE Multiset Table Compressed.Char20_003val_1K_rows
AS NonCompressed.Char20_003val_1k_rows
WITH DATA;

images

Once we had our eight tables created in the NonCompressed database, we moved them all with the copy statement above to the Compressed Database. Now, all eight tables in the Compressed database were an exact copy if the NonCompressed Database. Then, we compressed the eight tables in our Compressed database with the Nexus’ SmartCompress tool.

Compression Reports with Nexus and SmartCompress

images

Above, shows the Nexus and SmartCompress reports. You can compress Teradata tables and then view the reports to examine the savings. This compress was very successful and a good choice.

We Then Created Two Global Temporary Tables

CREATE Global Temporary TABLE Compressed.PermSpaceCompare, No Log

( TableName       Varchar(32)

,Currentperm      Decimal(18,2)

,PeakPerm          Decimal(18,2)

)

ONCOMMIT Delete ROWS;

 

 

 

CREATE Global Temporary TABLE NonCompressed.PermSpaceCompare, No Log

(TableName       Varchar(32)

,Currentperm      Decimal(18,2)

,PeakPerm          Decimal(18,2)

)

ONCOMMIT Delete ROWS;

We created two Global Temporary Tables so we could run next run a macro that would populate both with space information, and then we could compare and contrast the sizes.

We Then Created and Executed our Macro

REPLACE Macro Compressed.SpaceCompare AS (

INSERT INTO NonCompressed.PermSpaceCompare

SELECT TableName ,Sum(CurrentPerm) as CurrPERM ,Sum(PeakPerm)  as PeakPERM

FROM DBC.Tablesize

WHERE DatabaseName = ‘NonCompressed’ And TableName like ‘%val%’ GROUP BY 1 ;

INSERT INTO Compressed.PermSpaceCompare

SELECT TableName ,Sum(CurrentPerm) as CurrPERM ,Sum(PeakPerm)  as PeakPERMFROM

DBC.Tablesize WHERE DatabaseName = ‘Compressed’

And TableName like ‘%val%’ GROUP BY 1 ;

With TeraTom(TableN, “% Saved”) as

(SELECT CCC.TableName ,ABS(((CCC.CurrentPerm-NNN.CurrentPerm)/nullifzero (NNN.CurrentPerm)) * 100)

FROM               Compressed.PermSpaceCompare as CCC

INNER JOIN    NonCompressed.PermSpaceCompare as NNN

ON                    CCC.TableName = NNN.TableName)

SELECT C.TableName AS “Table//Name”,N.CurrentPerm AS “Non-Compressed// Perm”

          ,C.CurrentPerm AS “Compressed// Perm”,“% Saved” AS “Percentage// Saved”

FROM Compressed.PermSpaceCompare as C

INNER JOIN

             NonCompressed.PermSpaceCompare as N

ON       C.TableName = N.TableName

INNER JOIN

             TeraTom

ON C.TableName = TableN

ORDER BY 1;);

Report Comparing Compressed and NonCompressed Tables

images

The test just proved that compressing less values reduces the cost and can save more space.

We were now able to compare our Compressed Vs. our NonCompressed tables. Remember that all eight tables had 1,000 rows and all 1,000 rows were compressed in each. This experiment shows that decisions need to be made about whether or not to compress all 255 values plus null or whether you should compress less and save more.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset