Chapter 4 - Compression

“Speak in a moment of anger and you’ll deliver the greatest speech you’ll ever regret.”

- Anonymous

Compression Types

image

The table above identifies the supported compression encodings and the data types that support the encoding. Compression reduces the size of data when it is stored, and it is a column-level operation. Compression conserves storage space and reduces the size of data that is read from storage, which will then reduce the amount of disk I/O, thus improving query performance. By default, Amazon Redshift stores data in its raw, uncompressed format, but you can apply a compression type, or encoding, to the columns in a table manually (when the table is created). Or you can use the COPY command to analyze and apply compression automatically. Either way, it is important to compress your data.

Byte Dictionary Compression

image

Byte dictionary encoding utilizes a separate dictionary of unique values for each block of column values on disk. Remember, each Amazon Redshift disk block occupies 1 MB. The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This encoding is very effective when a column contains a limited number of unique values, and it is especially optimal when there is less than 256 unique values.

Delta Encoding

image

Delta encodings are very useful for date and time columns. Delta encoding compresses data by recording the difference between values that follow each other in the column. These differences are recorded in a separate dictionary for each block of column values on disk. If the column contains 10 integers in sequence from 1 to 10, the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the value 1, indicating that it is one greater than the previous value. Delta encoding comes in two variations. DELTA records the differences as 1-byte values (8-bit integers), and DELTA32K records differences as 2-byte values (16-bit integers)

LZO Encoding

image

Designed to work best with Char and Varchar data that store long character strings

Is a portable lossless data compression library written in ANSI C

Offers fast compression but extremely fast decompression

Includes slower compression levels achieving a quite competitive compression ratio while still decompressing at this very high speed

Often implemented with a tool called LZOP

Lempel–Ziv–Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed. LZO encoding provides a high compression ratio with good performance. LZO encoding is designed to work well with character data. It is especially good for CHAR and VARCHAR columns that store very long character strings especially free form text such as product descriptions, user comments, or JSON strings.

Mostly Encoding

image

Mostly encodings are useful when the data type for a column is larger than the majority of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form.

Runlength encoding

image

Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run). This is where the name Runlength comes into play. A separate dictionary of unique values is created for each block of column values on disk. This encoding is best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values.

Text255 and Text32k Encodings

image

Text255 and text32k encodings are useful for compressing VARCHAR columns only. Both compression techniques work best when the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. Text255 has a dictionary that contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value representing one of the 245 values, and any words that are not represented in the dictionary are stored uncompressed. This process is repeated for each block.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a specific number of words. Instead, the dictionary indexes each unique word it finds until the combined entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.

ANALYZE COMPRESSION

ANALYZE COMPRESSION

[ [ table_name ]

[ ( column_name [, . . .] ) ] ]

[COMPROWS numrows]

Table_Name-You can optionally specify a table_name to analyze a single table. If you do not specify a table_name, all of the tables in the currently connected database are analyzed. You cannot specify more than one table_name with a single ANALYZE COMPRESSION statement. You can also analyze compression for temporary tables.

Column_Name-If you specify a table_name, you can also specify one or more columns in the table (as a column-separated list within parentheses).

COMPROWS–This is the number of rows to be used as the sample size for compression analysis. The analysis is run on rows from each data slice. For example, if you specify COMPROWS 2000000 (2,000,000) and the system contains 4 total slices, no more than 500,000 rows per slice are read and analyzed. If COMPROWS is not specified, the sample size defaults to 100,000 per slice.

Numrows -Number of rows to be used as the compression sample size. The accepted range for numrows is a number between 1000 and 1000000000 (1,000,000,000).

The ANALYZE Compression command performs compression analysis and produces a report with the suggested column encoding schemes for the tables analyzed. ANALYZE COMPRESSION does not modify the column encodings of the table but merely makes suggestions. To implement the suggestions, you must recreate the table, or create a new table with the same schema. ANALYZE COMPRESSION does not consider Runlength encoding on any column that is designated as a SORTKEY. This is because range-restricted scans might perform poorly when SORTKEY columns are compressed much more highly than other columns. ANALYZE COMPRESSION acquires an exclusive table lock, which prevents concurrent reads and writes against the table. Only run the ANALYZE COMPRESSION command when the table is idle.

Copy

image

The above example (two parts) gives the syntax for COPY from Amazon S3, COPY from Amazon EMR, and COPY from a remote host (COPY from SSH).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset