Chapter 2 – Greenplum Table Structures

“Let me once again explain the rules. The Greenplum Data Warehouse Rules!”

- Tera-Tom Coffing

The Concepts of Greenplum Tables

1.Tables are either Distributed by Hash or Distributed Randomly.

2.Tables are either stored in a heap or are append-only tables.

3.The rows of a table by default are unsorted in a heap or they can be physically sorted with a clustered index.

4.Tables are stored physically on disk in either a row or columnar format.

5.Tables can be partitioned.

6.Tables are either permanent, temporary or external Tables.

7.Table can have Primary and Foreign Key constraints although Foreign Key constraints are not enforced.

8.Tables can have Unique constraints and other Boolean constraints.

9.Compression techniques are supported at the table or column level.

Above are some basics about concepts for Greenplum tables. The next several pages will cover each point one at a time. This will allow you to see exactly what is going on immediately.

Tables are Either Distributed by Hash or Random

image

The Greenplum database gives you two choices for table distribution. These choices are either distributed by Hash or randomly distributed. Large fact tables are usually hashed and smaller tables are often random. When a table is hashed, one or more columns are chosen as the distribution key. In our example above, the Employee_Table (top) is hashed by the Employee_No. The Random table (bottom) only has six rows in it and they are evenly distributed.

A Hash Distributed Table has A Distribution Key

CREATE TABLE Emp_Intl (

Employee_NoINTEGER

,Dept_NoSMALLINT

,First_NameVARCHAR(12)

,Last_NameCHAR(20)

,SalaryDECIMAL(8,2)

) DISTRIBUTED BY (Employee_No) ;

image

Above is a basic TABLE CREATE STATEMENT for a table with a Distribution Key. You can use one or more columns as the Distribution Key on Greenplum. The values in this column will be hashed with a hashing formula and used to distribute the rows of the table across the Segments. Picking a good key is essential. An excellent Distribution Key will allow for even distribution among the many segments.

Picking A Distribution Key That Is Not Very Unique

CREATE TABLE Emp_Intl (

Employee_NoINTEGER

,Dept_NoSMALLINT

,First_NameVARCHAR(12)

,Last_NameCHAR(20)

,SalaryDECIMAL(8,2)

) DISTRIBUTED BY (Last_Name) ;

image

The hash formula will distribute like values to the same segment. This can result in skewed data. Pick a good distribution key or you could get uneven data. Notice that like values went to the same segment and the data is unevenly spread.

Random Distribution Uses a Round Robin Technique

CREATE TABLE Emp_Intl (

Employee_NoINTEGER

,Dept_NoSMALLINT

,First_NameVARCHAR(12)

,Last_NameCHAR(20)

,SalaryDECIMAL(8,2)

) DISTRIBUTED Randomly ;

image

Above is a basic TABLE CREATE STATEMENT for a table that is Distributed Randomly across all segments. That means that the rows are distributed in a round robin fashion to ensure even distribution. This should be done for relatively small tables, or for tables that don’t have a column or a column combination, that will provide reasonably even distribution.

Tables Will Be Distributed Among All Segments

image

Above we see 12 segments and five tables. Each table is spread across all 12 segments. All five tables above are row based tables. Some are hash distributed and some are randomly distributed. Just see the data and understand that tables are spread across all segments in order to take full advantage of parallel processing. Greenplum was born to be parallel. Also understand that all segments have the exact same table structures, but each segment is responsible for different rows.

The Default For Distribution Chooses the First Column

image

When no distribution is defined and the table is created, the default is a hash distribution policy. Greenplum will use either the PRIMARY KEY (if the table has one) or the first column of the table as the distribution key.

Table are Either a Heap or Append-Only

image

By default, Greenplum Database uses storage in an unsorted heap. Heap tables allows for data to be deleted or updated after it is initially loaded. Append-only table storage works best with denormalized fact tables in a data warehouse environment, where the data is not updated after it is loaded. Append-only tables eliminate about 20 bytes per row because there is not the storage overhead of the per-row update visibility information. Append-only tables do not allow UPDATE and DELETE operations, and single row INSERT statements are not recommended because they are slow.

Tables are Stored in Either Row or Columnar Format

image

A table is stored in either a row format or a columnar format. Traditionally, most systems have always stored the rows of a table in a row format (row store). When a query is run on the table the entire block of rows must be moved from disk into memory, where they are processed. This works well when all columns (or most columns) are needed to satisfy the query. Modern designs of computer systems will often now include a column format (column store). This works extremely well on queries that don’t need all columns (or most columns) to satisfy the query, such as analytics, aggregations, etc. Only the columns needed will then be transferred from disk into memory. Greenplum gives you a choice of row, column or both.

Creating a Column Oriented Table

image

Column-oriented table storage must be append-only tables.

If rows are frequently inserted into the table, a row-oriented table is better optimized for write operations.

A column-oriented table stores the columns in different blocks. A segment still gets the entire row, but only needs to move the column(s) into memory that are required to satisfy the current query. This can save a lot of time and data movement for queries that do not need all of the columns to satisfy the answer set. This works quite well on aggregating of data, ordered-analytics, etc. The only major issue is that column-oriented tables must be append-only, however there are pretty substantial savings with column-oriented tables because the compression rates are so much better than row-oriented tables.

Comparing Normal Table vs. Columnar Tables

image

Above is a picture of the same table stored as a row-based (top) and column-based design. Notice that either way the node gets the entire row, but Greenplum has the option of storing it in either a row-based or column-based design. The row-based data is stored in one giant block so whenever the table is queried the entire block must move from disk into memory. The column-based design allows individual columns to move from disk to memory.

Columnar can move just One Column Block Into Memory

image

Columnar is brilliant when a query only needs a small portion of the columns from a table to satisfy the query. This is also considered vertical partitioning. Why eat the whole cake when you can take just a piece?

Segments on Distributions are aligned to Rebuild a Row

image

Columnar is brilliant when a query only needs a small portion of the columns from a table to satisfy the query. Instead of moving an entire block containing all columns and throwing out the ones you don’t need, you can use a columnar design to only retrieve the columns needed to satisfy the query.

Columnar Tables Store Each Column in Separate Blocks

image

This is the same data you saw on the previous page! The difference is that the above is a columnar design. I have color coded this for you. There are 8 rows in the table and five columns. Notice that the entire row stays on the same segment, but each column is a separate block. This is a brilliant design for Ad Hoc queries and analytics because when only a few columns are needed, columnar can move just the columns it needs to move. Columnar can’t be beat for queries because the blocks are so much smaller, and what isn’t needed isn’t moved.

Visualize the Data – Rows vs. Columns

image

Both example above have the same data and the same amount of data. If your applications tend to need to analyze the majority of columns or read the entire table, then a row-based system (top example) can move more data into memory. Columnar tables are advantageous when only a few columns need to be read. This is just one of the reasons that analytics goes with columnar like bread goes with butter. A row-based system must move the entire page into memory even if it only needs to read one row or even a single column. If a user above needed to analyze the Salary, the columnar system would move 80% less block mass.

Table Rows are Either Sorted or Unsorted

image

The rows of a table are either sorted or unsorted. If the table has a clustered index it is sorted, but if it does not have a clustered index then it is unsorted, which is referred to as a heap. You can only have one clustered index per table because you can only sort a table one way. Sorting has nothing to do with a distribution key or a Random table, but once the rows are placed on a segment they are either sorted (clustered index) or unsorted (heap).

Creating a Clustered Index in Order to Physically Sort Rows

image

Above, we have sorted the Order_Cluster table on each segment by Order_Date. You can have one clustered index on a table because you can only sort the rows one specific way. Having a Clustered Index on a Date column will help with range queries, because the data on each segment is sorted by date. CLUSTER is not supported with append-only tables or column-oriented tables.

Physically Ordered Tables Are Faster on Certain Queries

image

SELECT * FROM Order_Cluster

WHERE Order_Date Between '2015-05-01' AND '2015-05-31'

Range queries on date columns can benefit greatly from a clustered index. The above table is physically sorted on each segment by the column Order_Date. The query above won’t have to do a Full Table Scan (FTS), but instead read only the rows that fall within the sorted range.

Another Way to Create a Clustered Table

image

Above, we have sorted the Order_Cluster table on each segment by Order_Date. You can have one clustered index on a table because you can only sort the rows one specific way. Having a Clustered Index on a Date column will help with range queries, because the data on each segment is sorted by date. CLUSTER is not supported with append-only tables or column-oriented tables. This is a different way to create a table that is sorted. Warning: You cannot drop a table that has dependencies such as views.

Creating a B-Tree Index and then Running Analyze

Create Table Emp_2000

(Employee_No INTEGER

,Dept_NoINTEGER

,Last_NameVARCHAR(1000)

)Distributed BY (Dept_No);

CREATE INDEX Emp_Idx on Emp_2000 (Employee_No) ;

Analyze Emp_2000 ;

EXPLAIN
Select *  FROM Emp_2000
WHERE Employee_No = 1000020;

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..200.32 rows=1 width=64)
 -> Index Scan using emp_idx on emp_2000 (cost=0.00..200.32 rows=1 width=64)
Index Cond: employee_no = 1000020

Greenplum provides the index methods B-tree, bitmap, and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 72,000 rows. We then created a B-tree index (non-unique). We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. An Index Scan was utilized. We now know that the index on Employee_No is being used by the system.

Creating a Bitmap Index

Create Table Emp_75000

(Employee_No INTEGER

,Dept_NoINTEGER

,Last_NameVARCHAR(1000)

)Distributed BY (Employee_No);

CREATE INDEX Dept_bmp_Idx on Emp_75000 USING bitmap (Dept_No) ;

Analyze Emp_75000 ;

EXPLAIN
Select *  FROM Emp_75000
WHERE Dept_No = 1000021;

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..201.40 rows=1 width=64)
 -> Index Scan using dept_bmp_idx on emp_75000 (cost=0.00..201.40 rows=1 width=64)
Index Cond: dept_no = 1000021

Greenplum provides the index methods B-tree, bitmap, and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 75,000 rows. We then created a Bitmap Index. We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. A Bitmap Index Scan was utilized. We now know that the index on Dept_No is being used by the system.

Why Create a Bitmap Index?

Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause on large data warehouse tables that have few UPDATE and DELETE modifications.

Each bit in the bitmap corresponds to a possible tuple ID. If the bit is set, the row with the corresponding tuple ID contains the key value. A mapping function converts the bit position to a tuple ID.

Bitmaps reduce normal index size because they are compressed for storage.

The size of a bitmap index is equivalent to the number of rows in the table times the number of distinct values in the bitmap indexed column.

image

Greenplum provides the index methods B-tree, bitmap, and GiST. The default is a B-tree. These are best used when a query uses multiple columns that are ANDed together that both have a bitmap index.

Tables Can Be Partitioned

Greenplum Database supports both range and list partitioning.

Range partitioning is based on a numerical range, such as a date range or price range.

List partitioning is based on a list of values, such as region, state codes or a products.

Greenplum supports multi-level partitioning so a combination of both types is allowed.

Table partitioning logically divides large tables, such as Fact tables into smaller, more manageable tables. Partitioned tables improve query performance through partition elimination. Instead of performing a Full Table Scan (FTS) only the data partitions needed to satisfy the query are scanned. Partitioning does not change the physical distribution of table data across segments. It changes the way each segment sorts the rows. A partitioned table can also help with data maintenance tasks, such as rolling old data out of the data warehouse or loading new data into the data warehouse.

A Table Partitioned By Range (Per Month)

CREATE TABLE Ord_Tbl_Part (

Order_Number integer

,Customer_Number integer

,Order_Date date

,Order_Total decimal(10,2))

DISTRIBUTED BY (Order_Number)

PARTITION BY RANGE (Order_Date)

( START(date '2015-01-01') INCLUSIVE

END (date '2015-12-31') EXCLUSIVE

EVERY (INTERVAL '1 month'));

image

Above is the CREATE statement for the Ord_Tbl_Part table. This table is distributed by Hash on the column (Order_Number) and that is how the rows are placed on the proper segments, but this table is partitioned by Order_Date. This partitions the data on each segment by month. This physical partitioning allows for faster loads and faster maintenance (Insert, Update, Deletes). This is the design you want when users are performing range queries on dates.

A Visual of a Partitioned Table by Range (Month)

image

SELECT * FROM Ord_Tbl_Part

WHERE Order_Date Between '2015-05-01' AND '2015-05-31'

Each segment holds rows that were hash distributed by Order_Number, but once the rows for the table arrive on their respective segments they are sorted by month of Order_Date. Each month is stored in separate blocks. The above range query will not do a Full Table Scan (FTS). Each segment merely needs to read their May blocks.

Tables Can Be Partitioned by Day

CREATE TABLE Ord_Tbl_Day (

Order_Number integer

,Customer_Number integer

,Order_Date date

,Order_Total decimal(10,2))

DISTRIBUTED BY (Order_Number)

PARTITION BY RANGE (Order_Date)

( START(date '2015-01-01') INCLUSIVE

END (date '2015-12-31') EXCLUSIVE

EVERY (INTERVAL '1 Day'));

image

This partitions the data on each segment by day. It is my opinion that it is better to partition by day then by month.

Visualize a Partitioned Table by Day

image

There are 365 Partitions per year (366 during Leap Year)

This partitions the data on each segment by day. It is my opinion that it is better to partition by day then by month.

Creating a Partitioned Table Using a List

CREATE TABLE Employee_List

(Employee_No integer,

Dept_No smallint,

Last_name char(20),

First_name varchar(12),

Salary decimal(8,2) )

DISTRIBUTED BY(Employee_No)

PARTITION BY LIST (Dept_No)

(PARTITION Dept100 VALUES(100),

PARTITION Dept200 VALUES(200),

PARTITION Dept300 VALUES(300),

PARTITION Dept400 VALUES(400),

PARTITION Dept500 VALUES(500),

Default Partition Others);

This will create a partitioned table by a list.

Creating a Multi-Level Partitioned Table

CREATE TABLE Sales_Intl

(Product_idint

,Sale_Datedate

,Daily_Sales decimal(8,2)

,Regiontext)

DISTRIBUTED BY (Product_id)

PARTITION BY RANGE (Sale_Date)

SUBPARTITION BY LIST (region)

SUBPARTITION TEMPLATE

( SUBPARTITION usa VALUES ('usa'),

SUBPARTITION asia VALUES ('asia'),

SUBPARTITION europe VALUES ('europe'),

DEFAULT SUBPARTITION other_regions)

( START (date '2015-01-01') INCLUSIVE

END (date '2017-01-01') EXCLUSIVE

EVERY (INTERVAL '1 month'))

The primary partition of this table is the Sale_Date column. It is partitioned by Month and holds two years of data. Within each month, the table is partitioned by the column Region. This is an example of multi-level partitioning.

Changing a Table to a Partitioned Table

CREATE TABLE sales2 (LIKE sales_table)

PARTITION BY RANGE (sale_date)

( START (date '2015-01-01') INCLUSIVE

END (date '2016-01-01') EXCLUSIVE

EVERY (INTERVAL '1 month') );

INSERT INTO sales2 SELECT * FROM
sales_table;

DROP TABLE sales_table;

ALTER TABLE sales2 RENAME TO sales_table;

GRANT ALL PRIVILEGES ON sales_table TO admin;

GRANT SELECT ON sales_table TO guest;

Partitioned tables are created with the CREATE TABLE statement. A normal table cannot be altered directly into a partitioned table. Instead, you must first create an empty partitioned table, then load the data from the original table into the new table, next drop the original table, and then rename the partitioned table with the original table’s name. Finally, you must also re-grant any table permissions. Above is an example of every step needed.

Not Null Constraints

CREATE TABLE Employee_table

(

Employee_No integer NOT NULL,

Dept_No smallint,

Last_name char(20),

First_name varchar(12),

Salary decimal(8,2)

)

DISTRIBUTED BY(Employee_No) ;

The Not null constraint will not allow a Null value for that column. A not-null constraint is always written as a column constraint as in the example above.

Unique Constraints

CREATE TABLE Employee_table

(

Employee_No integer NOT NULL UNIQUE,

Dept_No smallint,

Last_name char(20),

First_name varchar(12),

Salary decimal(8,2)

)

DISTRIBUTED BY(Employee_No) ;

A Unique constraint will ensure that the data contained in a column is unique within the entire table. Any duplicate values will be rejected. Remember, the table must be hash-distributed (not DISTRIBUTED RANDOMLY), and the constraint columns must be the same as (or a superset of) the table’s distribution key columns, just like the example above.

Unique Constraints That Fail

image

ERROR [42000] [Nexus Chameleon][ODBC Greenplum Wire

Protocol driver][Greenplum]

ERROR: UNIQUE constraint and DISTRIBUTED BY

definitions incompatible

(Hint When there is both a UNIQUE constraint, and a

DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be

equal to or a left-subset of the UNIQUE columns;File analyze.c;Line

3415;Routine transformDistributedBy;)

CREATE TABLE Command Failed.

We have a UNIQUE constraint on the column Employee_No and we have a Distribution Key on Last_Name. This is not allowed with a UNIQUE constraint.

Primary Key Constraints

CREATE TABLE Employee_table

(

Employee_No integer PRIMARY KEY,

Dept_No smallint,

Last_name char(20),

First_name varchar(12),

Salary decimal(8,2)

)

DISTRIBUTED BY(Employee_No);

Foreign keys are not supported. You
can declare them, but referential
integrity is never enforced.

When you create a primary key constraint you are really creating a combination of a UNIQUE constraint and a NOT NULL constraint simultaneously. The table must be hash-distributed (not DISTRIBUTED RANDOMLY), and the primary key columns must be the same as (or a superset of) the table’s distribution key columns. If a table has a primary key, this column (or group of columns) is chosen as the distribution key for the table by default.

A Primary Key Automatically Creates a Unique Index

image

ERROR [23000] [Nexus Chameleon][ODBC Greenplum Wire Protocol driver][Greenplum]ERROR: duplicate key violates unique constraint "emp_tbl_pk_pkey" INSERT Command Failed.

A PRIMARY KEY CONSTRAINT automatically and implicitly creates a UNIQUE INDEX. That UNIQUE INDEX must include the distribution key column(s) (plus any partitioning keys). The PRIMARY KEY UNIQUE constraint is enforced across the entire table, including all table partitions. You cannot have duplicate values in a Primary Key column.

Check Constraints

CREATE TABLE Employee_table

(

Employee_No integer ,

Dept_No smallint CHECK (Dept_No < 500),

Last_name char(20),

First_name varchar(12),

Salary decimal(8,2)

)

DISTRIBUTED BY(Employee_No);

A Check constraint will allow you to
specify that the value falls into a
certain range.

A Check constraint will allow you to specify that the value falls into a certain range. The column must satisfy a Boolean (truth-value) expression.

Creating an Automatic Number Called a Sequence

CREATE SEQUENCE Emp_Number START 1000000;

After you create a Sequence table, you use the nextval function.

INSERT INTO Employee_Tbl
VALUES (nextval('Emp_Number'), 400, 'Mike', 'Daly', 75000.50);

INSERT INTO Employee_Tbl
VALUES (nextval('Emp_Number'), 300, 'Anita', 'Foxx', 85000.25);

SELECT *
FROM Employee_Tbl ;

image

The CREATE SEQUENCE command creates and initializes a sequence table with the given sequence name. The sequence name must be unique from other sequences, tables, indexes, or views within the same schema. This will be a way to get a sequential numbers for a table.

Multiple INSERT example using a Sequence

CREATE TABLE Orders_Day_Partitioned (

Order_Number integer

,Customer_Number integer

,Order_Date date

,Order_Total decimal(10,2))

DISTRIBUTED BY (Order_Number)

PARTITION BY RANGE (Order_Date)

( START(date '2015-01-01') INCLUSIVE

END (date '2015-12-31') EXCLUSIVE

EVERY (INTERVAL '1 Day'));

image

We have created a table partitioned by day. We also created a Sequence table. We then inserted 4 rows in one statement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset