Chapter 19 – Kognitio Explain

“Write a wise saying and your name will live forever.”

- Anonymous

How to See an EXPLAIN Plan

image

Diagnostic on kognitioaws for 7.02.01-rel120813:

1 We apply 1 shared lock on table EMPLOYEE_TABLE(1086).

2 We select rows from the randomly distributed table EMPLOYEE_TABLE(1086).
From these rows, a result set will be generated containing 5 columns.
The results will be prepared to be fetched by the interpreter.
Approximately 9 rows will be in the result set with an estimated cost of
0.039.

3 We fetch rows and send them to the user.

-- Estimated Total Cost : 0.047

To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus.

Seeing an EXPLAIN Plan with Nexus

image

To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus. You can also press the function key F6.

The Eight Rules to Reading an EXPLAIN Plan

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.09 rows=9 width=43)
-> Seq Scan on emp_table (cost=0.00..2.09 rows=5 width=43)

1.Always read an EXPLAIN plan from the bottom to the top.

2.A query plan is a tree of nodes.

3.Each node feeds rows into the node directly above it.

4.Each node represents a single operation.

5.Operations include table scans, joins, sorts, aggregations, etc.

6.The bottom nodes of a plan are usually table scan operations.

7.The three type of table scans will be sequential, index or bitmap index scans.

8.If the query requires joins, aggregations, sorts, etc. on the rows,
there will be additional nodes above the scan nodes to perform these functions.

To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus.

Interpreting Keywords in an EXPLAIN Plan

The three main keywords you need to understand are cost, rows and width.

Great Advice: When comparing plans look at the cost at the top! This contains the estimated total execution cost for the plan.

Cost – Greenplum stores blocks is something called a page. Cost
measures in units of disk page retrievals. The number 1.0 equals one
sequential disk page read. The first cost number represents the start-
up cost of getting just the first row. The second cost number
represents the total cost of cost of getting all rows. The total cost is
an assumption that all rows will be retrieved, which may or may not
be true.

Rows - Greenplum uses the rows keyword to display the total
number of rows output for an individual plan node. This number is
most often a little less than the number of rows processed or scanned
by the plan node because it is the estimated count of the WHERE
clause condition. The general idea though is that the estimate for
the topmost node equals the approximate number of rows that the
query actually returns.

Width - The total bytes of all the rows that this plan node outputs.

Above are the three main keywords in an EXPLAIN plan.

Interpreting an EXPLAIN Plan

image

Understanding the EXPLAIN is key. You will soon be able to compare different techniques to see which is the best path.

A Single Segment Retrieve – The Fastest Query

image

The fastest query is when you utilize the Distribution Key in the WHERE clause with an equality statement. It is a single segment retrieve. Kognitio knows which segment the rows was distributed to so it knows which segment to find it on.

EXPLAIN With an ORDER BY Statement

SELECT * FROM Dept_Table

ORDER BY Department_Name  ;

image

Above are the metrics retrieved from this EXPLAIN plan.

EXPLAIN ANALYZE

EXPLAIN ANALYZE

SELECT * FROM Department_Table

ORDER BY Department_Name  ;

Gather Motion 2:1 (slice1; segments: 2) (cost=2.11..2.12 rows=5 width=46)

Merge Key: department_name

Rows out: 5 rows at destination with 1.655 ms to first row, 1.658 ms to end,
start offset by 6.930 ms.

->  Sort  (cost=2.11..2.12 rows=3 width=46)

Sort Key: department_name

Rows out: Avg 2.5 rows x 2 workers. Max 3 rows (seg0) with 0.401 ms to first row,
0.402 ms to end, start offset by 8.077 ms.

Executor memory: 58K bytes avg, 58K bytes max (seg0).

Work_mem used: 58K bytes avg, 58K bytes max (seg0). Workfile: (0 spilling, 0 reused)

->  Seq Scan on department_table (cost=0.00..2.05 rows=3 width=46)

Rows out: Avg 2.5 rows x 2 workers. Max 3 rows (seg0) with 0.122 ms to first row,
0.124 ms to end, start offset by 8.323 ms.

Slice statistics:

 (slice0)    Executor memory: 163K bytes.

 (slice1)    Executor memory: 195K bytes avg x 2 workers, 195K bytes max (seg0).

Work_mem: 58K bytes max.

Statement statistics:

Memory used: 128000K bytes

Total runtime: 8.811 ms

EXPLAIN ANALYZE runs the ANALYZE statement in addition to displaying its plan so it brings back a lot more information. This can often be useful if further analysis is needed on a query.

EXPLAIN With a Range Query on a Table Partitioned By Day

CREATE TABLE Orders_Day_Partitioned (

Order_Number integer

,Customer_Number integer

,Order_Date date

,Order_Total decimal(10,2))

DISTRIBUTED BY (Order_Number)

PARTITION BY RANGE (Order_Date)

( START(date '2015-01-01') INCLUSIVE

END (date '2015-12-31') EXCLUSIVE

EVERY (INTERVAL '1 Day'));

SELECT * FROM Orders_Day_Partitioned
WHERE Order_Date BETWEEN '2015-01-01' AND '2015-01-02' ;

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..19.85 rows=12 width=25)

   -> Append (cost=0.00..19.85 rows=6 width=25)

        ->   Seq Scan on orders_day_partitioned_1_prt_1 orders_day_partitioned (cost=0.00..9.92 rows=3 width=25)

               Filter: order_date >= '2015-01-01'::date AND order_date <= '2015-01-02'::date

        ->   Seq Scan on orders_day_partitioned_1_prt_2 orders_day_partitioned (cost=0.00..9.92 rows=3 width=25)

               Filter: order_date >= '2015-01-01'::date AND order_date <= '2015-01-02'::date

This EXPLAIN shows that all segments are being used (2), but only 2 partitions are being scanned. There are 365 partitions in the table so only a small portion of the table is being read.

EXPLAIN That Uses a B-Tree Index Scan

Create Table Emp_2000

(Employee_No INTEGER

,Dept_No          INTEGER

,Last_Name      VARCHAR(1000)

)Distributed BY (Dept_No) ;

CREATE INDEX Emp_Idx on Emp_2000 (Employee_No) ;

Analyze Emp_2000 ;

EXPLAIN

Select *  FROM Emp_2000

WHERE Employee_No = 1000020;

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..200.32 rows=1 width=64)

  ->  Index Scan using emp_idx on emp_2000 (cost=0.00..200.32 rows=1 width=64)

        Index Cond: employee_no = 1000020

Kognitio provides the index methods B-tree, bitmap and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 72,000 rows. We then created a B-tree index (non-unique). We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. An Index Scan was utilized. We now know that the index on Employee_No is being used by the system.

EXPLAIN That Uses a Bitmap Scan

Create Table Emp_75000

(Employee_No INTEGER

,Dept_No          INTEGER

,Last_Name      VARCHAR(1000)

)Distributed BY (Employee_No);

CREATE INDEX Dept_bmp_Idx on Emp_75000 USING bitmap (Dept_No) ;

Analyze Emp_75000 ;

EXPLAIN

Select *  FROM Emp_75000

WHERE Dept_No= 1000021;

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..201.40 rows=1 width=64)

 -> Index Scan using dept_bmp_idx on emp_75000 (cost=0.00..201.40 rows=1 width=64)

        Index Cond: dept_no = 1000021

Kognitio provides the index methods B-tree, bitmap and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 75,000 rows. We then created a Bitmap Index. We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. A Bitmap Index Scan was utilized. We now know that the index on Dept_No is being used by the system.

EXPLAIN With a Simple Subquery

image

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.21..4.34 rows=6 width=28)

-> Hash EXISTS Join (cost=2.21..4.34 rows=3 width=28)

     Hash Cond: customer_table.customer_number = order_table.customer_number

  ->  Seq Scan on customer_table (cost=0.00..2.05 rows=3 width=28)

  ->  Hash (cost=2.15..2.15 rows=3 width=4)

  -> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.15 rows=3 width=4)

        Hash Key: order_table.customer_number

  ->  Seq Scan on order_table (cost=0.00..2.05 rows=3 width=4)

This query will perform an EXIST compare by comparing customers who have
placed orders. If a customer_number from the Customer_Table EXISTS in the
Order_Table the Customer_Table row is returned. To make the compare the system
will Redistribute and Move (Motion) the Order_Table rows to the segment where the
Customer_Table rows reside naturally. This is done be rehashing the Order_Table
by customer_number.
This is all done because the compare key is on
customer_number and the Distribution Key of the Customer_Table is
customer_number, but the Distribution Key of the Order_Table is order_number.

This EXPLAIN does an EXIST query, but only after redistributing (rehashing) the Order_Table by Customer_Number. It then sends the newly hashed rows to the segments where the current Customer_Table rows reside for an EXIST comparison.

EXPLAIN With a Columnar Query

image

Aggregate  (cost=2.17..2.18 rows=1 width=32)

-> Gather Motion 2:1  (slice1; segments: 2) (cost=2.11..2.15 rows=1 width=32)

 -> Aggregate (cost=2.11..2.12 rows=1 width=32)

 ->  Append-only Columnar Scan on emp_tbl_columnar (cost=0.00..2.09 rows=5 width=9)

This EXPLAIN shows that a Columnar Scan is done, which means only the columns needed to satisfy the query (Salary) will be placed into memory. All segments are used, but not all columns from the table. The EXPLAIN also shows that this is an aggregate query. The Gather Motion indicates that data is going to move internally across segments for the final average.

EXPLAIN With a Clustered Index

image

SELECT Claim_Id, Claim_Date, Claim_Amt, Claim_Service

FROM    Claims2

WHERE Claim_Date >= '2003-03-01'

AND       Claim_Date <= '2003-03-04'

Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.63 rows=10 width=19)

->  Seq Scan on claims2 (cost=0.00..2.63 rows=5 width=19)

Filter: claim_date >= '2003-03-01'::date AND claim_date <= '2003-03-04'::date

This EXPLAIN shows that the table expects 10 rows and performs a filter on the date range.

The Most Important Concept for Joins is the Distribution Key

Here are the table structures created to show the Join EXPLAIN statement

image

CREATE TABLE  Emp_Table_DeptKey  (

Employee_No             INTEGER

,Dept_No                     SMALLINT

,Last_Name                 CHAR(20)

,First_Name                 VARCHAR(12)

,Salary                          DECIMAL(8,2)

) DISTRIBUTED BY (Dept_No) ;

We have created three tables to show the EXPLAIN information. Notice the Distribution key for each. The will be key when joining these tables together because matching rows must be on the same segment. If not, then Kognitio moves them for the life of the join. This is why you will see keywords like Motion, Broadcast and Hash Distributed.

EXPLAIN With Join that has to Move Data

SELECT E.*,  Department_Name

FROM     Emp_Table as E

INNER JOIN

Dept_Table as D

ON          E.Dept_No = D.Dept_No  ;

image

Information you need to know before seeing the EXPLAIN plan:

The tables are joined together by Dept_No = Dept_No

The Emp_Table has 9 rows, but one Dept_No is Null

The Distribution Key of the Emp_Table is Employee_No

The Distribution Key of the Dept_Table is Dept_No

Check out the important information above.

EXPLAIN With Join that has to Move Data

EXPLAIN SELECT E.*, Department_Name

FROM     Emp_Table as E

INNER JOIN

Dept_Table as D

ON            E.Dept_No = D.Dept_No;

image

What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen.

Changing the Join Query Changes the EXPLAIN Plan

image

What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen.

Analyzing the Tables Structures For a 3-Table Join

image

What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen. Let's take a look at the explain plan to see its explanation.

An EXPLAIN For a 3-Table Join

image

Gather Motion 2:1 (slice2; segments: 2) (cost=4.62..7.20 rows=14 width=50)

->  Hash Join (cost=4.62..7.20 rows=7 width=50)

      Hash Cond: sc.course_id = c.course_id

  ->  Hash Join (cost=2.23..4.58 rows=8 width=29)

         Hash Cond: sc.student_id = s.student_id

     ->  Seq Scan on student_course_table sc (cost=0.00..2.14 rows=7 width=6)

      ->  Hash (cost=2.10..2.10 rows=5 width=31)

        ->   Seq Scan on student_table s (cost=0.00..2.10 rows=5 width=31)

->  Hash (cost=2.24..2.24 rows=6 width=25)

 ->  Broadcast Motion 2:2 (slice1; segments: 2) (cost=0.00..2.24 rows=6 width=25)

 ->  Seq Scan on course_table c (cost=0.00..2.06 rows=3 width=25)

What few people in the world understand about joins is that two rows being joined need to be on the same segment. Because the Student_Course_Table and the Student_Table are joined on Student_ID, and both have a Distribution Key of Student_ID, so the matching rows naturally reside on the same segment. These tables are joined first. After they produce an intermediate answer set, the course_table is then broadcast to both segments for the final join.

Explain of a Derived Table vs. a Correlated Subquery

Both queries will return all columns from the Employee_Table if the
employee makes a salary > Avg(Salary) within their own department.

image

Both queries return the exact same answer set

image

The three rows in the answer set are employees making a greater salary than the average salary within their dept_no. We were able to do this through a correlated subquery and a derived table. Now, we can compare the EXPLAIN plans.

Explain of the Correlated Subquery

Correlated Subquery

SELECT *

FROM Employee_Table as E

WHERE Salary >

  (SELECT AVG(Salary)

  FROM Employee_Table as EE

  WHERE E.Dept_No = EE.Dept_No) ;

Gather Motion 2:1 (slice3; segments: 2) (cost=2.56..4.92 rows=4 width=43)

-> Hash Join (cost=2.56..4.92 rows=2 width=43)

Hash Cond: e.dept_no = "Expr_SUBQUERY".csq_c0

Join Filter: e.salary > "Expr_SUBQUERY".csq_c1

-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.27 rows=5 width=43)

Hash Key: e.dept_no

->  Seq Scan on employee_table e (cost=0.00..2.09 rows=5 width=43)

->  Hash  (cost=2.48..2.48 rows=3 width=34)

->  HashAggregate (cost=2.35..2.42 rows=3 width=34)

 Group By: ee.dept_no

-> Redistribute Motion 2:2 (slice2; segments: 2) (cost=2.13..2.25 rows=3 width=34)

 Hash Key: ee.dept_no

->  HashAggregate (cost=2.13..2.13 rows=3 width=34)

  Group By: ee.dept_no

->  Seq Scan on employee_table ee (cost=0.00..2.09 rows=5 width=11)

The next page shows the EXPLAIN plan of the Derived table. Both plans are close to the same.

Explain of the Derived Table

SELECT E.*

FROM Employee_Table as E

INNER JOIN

(SELECT Dept_No , AVG(Salary) as AVGSAL

FROM Employee_Table  GROUP BY Dept_No) AS TeraTom

ON E.Dept_No = TeraTom.Dept_No

AND Salary > AVGSAL

Gather Motion 2:1 (slice3; segments: 2) (cost=2.56..4.92 rows=4 width=43)

->  Hash Join (cost=2.56..4.92 rows=2 width=43)

 Hash Cond: e.dept_no = teratom.dept_no

 Join Filter: e.salary > teratom.avgsal

->   Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.27 rows=5 width=43)

 Hash Key: e.dept_no

->   Seq Scan on employee_table e (cost=0.00..2.09 rows=5 width=43)

->  Hash (cost=2.48..2.48 rows=3 width=34)

->  HashAggregate (cost=2.35..2.42 rows=3 width=34)

 Group By: employee_table.dept_no

->  Redistribute Motion 2:2 (slice2; segments: 2) (cost=2.13..2.25 rows=3 width=34)

 Hash Key: employee_table.dept_no

->  HashAggregate (cost=2.13..2.13 rows=3 width=34)

 Group By: employee_table.dept_no

->  Seq Scan on employee_table (cost=0.00..2.09 rows=5 width=11)

The previous page showed the EXPLAIN plan of the Correlated Subquery. Both plans are close to the same.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset