“Write a wise saying and your name will live forever.”
- Anonymous
Diagnostic on kognitioaws for 7.02.01-rel120813:
1 We apply 1 shared lock on table EMPLOYEE_TABLE(1086).
2 We select rows from the randomly distributed table EMPLOYEE_TABLE(1086).
From these rows, a result set will be generated containing 5 columns.
The results will be prepared to be fetched by the interpreter.
Approximately 9 rows will be in the result set with an estimated cost of
0.039.
3 We fetch rows and send them to the user.
-- Estimated Total Cost : 0.047
To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus.
To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus. You can also press the function key F6.
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.09 rows=9 width=43)
-> Seq Scan on emp_table (cost=0.00..2.09 rows=5 width=43)
1.Always read an EXPLAIN plan from the bottom to the top.
2.A query plan is a tree of nodes.
3.Each node feeds rows into the node directly above it.
4.Each node represents a single operation.
5.Operations include table scans, joins, sorts, aggregations, etc.
6.The bottom nodes of a plan are usually table scan operations.
7.The three type of table scans will be sequential, index or bitmap index scans.
8.If the query requires joins, aggregations, sorts, etc. on the rows,
there will be additional nodes above the scan nodes to perform these functions.
To get an EXPLAIN you can either type the word EXPLAIN in front of any Kognitio SQL or click on the magnifying glass of Nexus.
The three main keywords you need to understand are cost, rows and width.
Great Advice: When comparing plans look at the cost at the top! This contains the estimated total execution cost for the plan.
Cost – Greenplum stores blocks is something called a page. Cost
measures in units of disk page retrievals. The number 1.0 equals one
sequential disk page read. The first cost number represents the start-
up cost of getting just the first row. The second cost number
represents the total cost of cost of getting all rows. The total cost is
an assumption that all rows will be retrieved, which may or may not
be true.
Rows - Greenplum uses the rows keyword to display the total
number of rows output for an individual plan node. This number is
most often a little less than the number of rows processed or scanned
by the plan node because it is the estimated count of the WHERE
clause condition. The general idea though is that the estimate for
the topmost node equals the approximate number of rows that the
query actually returns.
Width - The total bytes of all the rows that this plan node outputs.
Above are the three main keywords in an EXPLAIN plan.
Understanding the EXPLAIN is key. You will soon be able to compare different techniques to see which is the best path.
The fastest query is when you utilize the Distribution Key in the WHERE clause with an equality statement. It is a single segment retrieve. Kognitio knows which segment the rows was distributed to so it knows which segment to find it on.
SELECT * FROM Dept_Table
ORDER BY Department_Name ;
Above are the metrics retrieved from this EXPLAIN plan.
EXPLAIN ANALYZE
SELECT * FROM Department_Table
ORDER BY Department_Name ;
Gather Motion 2:1 (slice1; segments: 2) (cost=2.11..2.12 rows=5 width=46)
Merge Key: department_name
Rows out: 5 rows at destination with 1.655 ms to first row, 1.658 ms to end,
start offset by 6.930 ms.
-> Sort (cost=2.11..2.12 rows=3 width=46)
Sort Key: department_name
Rows out: Avg 2.5 rows x 2 workers. Max 3 rows (seg0) with 0.401 ms to first row,
0.402 ms to end, start offset by 8.077 ms.
Executor memory: 58K bytes avg, 58K bytes max (seg0).
Work_mem used: 58K bytes avg, 58K bytes max (seg0). Workfile: (0 spilling, 0 reused)
-> Seq Scan on department_table (cost=0.00..2.05 rows=3 width=46)
Rows out: Avg 2.5 rows x 2 workers. Max 3 rows (seg0) with 0.122 ms to first row,
0.124 ms to end, start offset by 8.323 ms.
Slice statistics:
(slice0) Executor memory: 163K bytes.
(slice1) Executor memory: 195K bytes avg x 2 workers, 195K bytes max (seg0).
Work_mem: 58K bytes max.
Statement statistics:
Memory used: 128000K bytes
Total runtime: 8.811 ms
EXPLAIN ANALYZE runs the ANALYZE statement in addition to displaying its plan so it brings back a lot more information. This can often be useful if further analysis is needed on a query.
CREATE TABLE Orders_Day_Partitioned (
Order_Number integer
,Customer_Number integer
,Order_Date date
,Order_Total decimal(10,2))
DISTRIBUTED BY (Order_Number)
PARTITION BY RANGE (Order_Date)
( START(date '2015-01-01') INCLUSIVE
END (date '2015-12-31') EXCLUSIVE
EVERY (INTERVAL '1 Day'));
SELECT * FROM Orders_Day_Partitioned
WHERE Order_Date BETWEEN '2015-01-01' AND '2015-01-02' ;
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..19.85 rows=12 width=25)
-> Append (cost=0.00..19.85 rows=6 width=25)
-> Seq Scan on orders_day_partitioned_1_prt_1 orders_day_partitioned (cost=0.00..9.92 rows=3 width=25)
Filter: order_date >= '2015-01-01'::date AND order_date <= '2015-01-02'::date
-> Seq Scan on orders_day_partitioned_1_prt_2 orders_day_partitioned (cost=0.00..9.92 rows=3 width=25)
Filter: order_date >= '2015-01-01'::date AND order_date <= '2015-01-02'::date
This EXPLAIN shows that all segments are being used (2), but only 2 partitions are being scanned. There are 365 partitions in the table so only a small portion of the table is being read.
Create Table Emp_2000
(Employee_No INTEGER
,Dept_No INTEGER
,Last_Name VARCHAR(1000)
)Distributed BY (Dept_No) ;
CREATE INDEX Emp_Idx on Emp_2000 (Employee_No) ;
Analyze Emp_2000 ;
EXPLAIN
Select * FROM Emp_2000
WHERE Employee_No = 1000020;
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..200.32 rows=1 width=64)
-> Index Scan using emp_idx on emp_2000 (cost=0.00..200.32 rows=1 width=64)
Index Cond: employee_no = 1000020
Kognitio provides the index methods B-tree, bitmap and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 72,000 rows. We then created a B-tree index (non-unique). We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. An Index Scan was utilized. We now know that the index on Employee_No is being used by the system.
Create Table Emp_75000
(Employee_No INTEGER
,Dept_No INTEGER
,Last_Name VARCHAR(1000)
)Distributed BY (Employee_No);
CREATE INDEX Dept_bmp_Idx on Emp_75000 USING bitmap (Dept_No) ;
Analyze Emp_75000 ;
EXPLAIN
Select * FROM Emp_75000
WHERE Dept_No= 1000021;
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..201.40 rows=1 width=64)
-> Index Scan using dept_bmp_idx on emp_75000 (cost=0.00..201.40 rows=1 width=64)
Index Cond: dept_no = 1000021
Kognitio provides the index methods B-tree, bitmap and GiST. The default is a B-tree. Above, we have created a table and loaded it with over 75,000 rows. We then created a Bitmap Index. We ran statistics on the table using the Analyze command. We then typed the keyword EXPLAIN in front of our query to see what type of scan would take place. A Bitmap Index Scan was utilized. We now know that the index on Dept_No is being used by the system.
Gather Motion 2:1 (slice2; segments: 2) (cost=2.21..4.34 rows=6 width=28)
-> Hash EXISTS Join (cost=2.21..4.34 rows=3 width=28)
Hash Cond: customer_table.customer_number = order_table.customer_number
-> Seq Scan on customer_table (cost=0.00..2.05 rows=3 width=28)
-> Hash (cost=2.15..2.15 rows=3 width=4)
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.15 rows=3 width=4)
Hash Key: order_table.customer_number
-> Seq Scan on order_table (cost=0.00..2.05 rows=3 width=4)
This query will perform an EXIST compare by comparing customers who have
placed orders. If a customer_number from the Customer_Table EXISTS in the
Order_Table the Customer_Table row is returned. To make the compare the system
will Redistribute and Move (Motion) the Order_Table rows to the segment where the
Customer_Table rows reside naturally. This is done be rehashing the Order_Table
by customer_number. This is all done because the compare key is on
customer_number and the Distribution Key of the Customer_Table is
customer_number, but the Distribution Key of the Order_Table is order_number.
This EXPLAIN does an EXIST query, but only after redistributing (rehashing) the Order_Table by Customer_Number. It then sends the newly hashed rows to the segments where the current Customer_Table rows reside for an EXIST comparison.
Aggregate (cost=2.17..2.18 rows=1 width=32)
-> Gather Motion 2:1 (slice1; segments: 2) (cost=2.11..2.15 rows=1 width=32)
-> Aggregate (cost=2.11..2.12 rows=1 width=32)
-> Append-only Columnar Scan on emp_tbl_columnar (cost=0.00..2.09 rows=5 width=9)
This EXPLAIN shows that a Columnar Scan is done, which means only the columns needed to satisfy the query (Salary) will be placed into memory. All segments are used, but not all columns from the table. The EXPLAIN also shows that this is an aggregate query. The Gather Motion indicates that data is going to move internally across segments for the final average.
SELECT Claim_Id, Claim_Date, Claim_Amt, Claim_Service
FROM Claims2
WHERE Claim_Date >= '2003-03-01'
AND Claim_Date <= '2003-03-04'
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2.63 rows=10 width=19)
-> Seq Scan on claims2 (cost=0.00..2.63 rows=5 width=19)
Filter: claim_date >= '2003-03-01'::date AND claim_date <= '2003-03-04'::date
This EXPLAIN shows that the table expects 10 rows and performs a filter on the date range.
Here are the table structures created to show the Join EXPLAIN statement
CREATE TABLE Emp_Table_DeptKey (
Employee_No INTEGER
,Dept_No SMALLINT
,Last_Name CHAR(20)
,First_Name VARCHAR(12)
,Salary DECIMAL(8,2)
) DISTRIBUTED BY (Dept_No) ;
We have created three tables to show the EXPLAIN information. Notice the Distribution key for each. The will be key when joining these tables together because matching rows must be on the same segment. If not, then Kognitio moves them for the life of the join. This is why you will see keywords like Motion, Broadcast and Hash Distributed.
SELECT E.*, Department_Name
FROM Emp_Table as E
INNER JOIN
Dept_Table as D
ON E.Dept_No = D.Dept_No ;
Information you need to know before seeing the EXPLAIN plan:
•The tables are joined together by Dept_No = Dept_No
•The Emp_Table has 9 rows, but one Dept_No is Null
•The Distribution Key of the Emp_Table is Employee_No
•The Distribution Key of the Dept_Table is Dept_No
Check out the important information above.
EXPLAIN SELECT E.*, Department_Name
FROM Emp_Table as E
INNER JOIN
Dept_Table as D
ON E.Dept_No = D.Dept_No;
What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen.
What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen.
What few people in the world understand about joins is that two rows being joined need to be on the same segment. When the Distribution Keys do not match the join keys, then Kognitio must move data to make the join happen. Let's take a look at the explain plan to see its explanation.
Gather Motion 2:1 (slice2; segments: 2) (cost=4.62..7.20 rows=14 width=50)
-> Hash Join (cost=4.62..7.20 rows=7 width=50)
Hash Cond: sc.course_id = c.course_id
-> Hash Join (cost=2.23..4.58 rows=8 width=29)
Hash Cond: sc.student_id = s.student_id
-> Seq Scan on student_course_table sc (cost=0.00..2.14 rows=7 width=6)
-> Hash (cost=2.10..2.10 rows=5 width=31)
-> Seq Scan on student_table s (cost=0.00..2.10 rows=5 width=31)
-> Hash (cost=2.24..2.24 rows=6 width=25)
-> Broadcast Motion 2:2 (slice1; segments: 2) (cost=0.00..2.24 rows=6 width=25)
-> Seq Scan on course_table c (cost=0.00..2.06 rows=3 width=25)
What few people in the world understand about joins is that two rows being joined need to be on the same segment. Because the Student_Course_Table and the Student_Table are joined on Student_ID, and both have a Distribution Key of Student_ID, so the matching rows naturally reside on the same segment. These tables are joined first. After they produce an intermediate answer set, the course_table is then broadcast to both segments for the final join.
Both queries will return all columns from the Employee_Table if the
employee makes a salary > Avg(Salary) within their own department.
Both queries return the exact same answer set
The three rows in the answer set are employees making a greater salary than the average salary within their dept_no. We were able to do this through a correlated subquery and a derived table. Now, we can compare the EXPLAIN plans.
Correlated Subquery
SELECT *
FROM Employee_Table as E
WHERE Salary >
(SELECT AVG(Salary)
FROM Employee_Table as EE
WHERE E.Dept_No = EE.Dept_No) ;
Gather Motion 2:1 (slice3; segments: 2) (cost=2.56..4.92 rows=4 width=43)
-> Hash Join (cost=2.56..4.92 rows=2 width=43)
Hash Cond: e.dept_no = "Expr_SUBQUERY".csq_c0
Join Filter: e.salary > "Expr_SUBQUERY".csq_c1
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.27 rows=5 width=43)
Hash Key: e.dept_no
-> Seq Scan on employee_table e (cost=0.00..2.09 rows=5 width=43)
-> Hash (cost=2.48..2.48 rows=3 width=34)
-> HashAggregate (cost=2.35..2.42 rows=3 width=34)
Group By: ee.dept_no
-> Redistribute Motion 2:2 (slice2; segments: 2) (cost=2.13..2.25 rows=3 width=34)
Hash Key: ee.dept_no
-> HashAggregate (cost=2.13..2.13 rows=3 width=34)
Group By: ee.dept_no
-> Seq Scan on employee_table ee (cost=0.00..2.09 rows=5 width=11)
The next page shows the EXPLAIN plan of the Derived table. Both plans are close to the same.
SELECT E.*
FROM Employee_Table as E
INNER JOIN
(SELECT Dept_No , AVG(Salary) as AVGSAL
FROM Employee_Table GROUP BY Dept_No) AS TeraTom
ON E.Dept_No = TeraTom.Dept_No
AND Salary > AVGSAL
Gather Motion 2:1 (slice3; segments: 2) (cost=2.56..4.92 rows=4 width=43)
-> Hash Join (cost=2.56..4.92 rows=2 width=43)
Hash Cond: e.dept_no = teratom.dept_no
Join Filter: e.salary > teratom.avgsal
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..2.27 rows=5 width=43)
Hash Key: e.dept_no
-> Seq Scan on employee_table e (cost=0.00..2.09 rows=5 width=43)
-> Hash (cost=2.48..2.48 rows=3 width=34)
-> HashAggregate (cost=2.35..2.42 rows=3 width=34)
Group By: employee_table.dept_no
-> Redistribute Motion 2:2 (slice2; segments: 2) (cost=2.13..2.25 rows=3 width=34)
Hash Key: employee_table.dept_no
-> HashAggregate (cost=2.13..2.13 rows=3 width=34)
Group By: employee_table.dept_no
-> Seq Scan on employee_table (cost=0.00..2.09 rows=5 width=11)
The previous page showed the EXPLAIN plan of the Correlated Subquery. Both plans are close to the same.