Chapter 6 – Join Functions

“When spider webs unite they can tie up a lion.”

- African Proverb

Hadoop Joins

1.Each Table in Hadoop could hold a portion of a table.

2.For two rows to be Joined together, Hadoop insists that both rows are physically on the same reducer.

3.Hadoop will either Redistribute one or both of the tables or Duplicate the smaller table across all reducers to ensure matching rows are on the same reducer. This is done only for the life of the Join.

image

The three options are a Shuffle, Map or Sort Merge-Bucket Join

Two joining rows have to be in the same memory of a single reducer. The three options are a Shuffle join, Map join or a Sort Merge-Bucket Join.

The Shuffle Join (1 of 2)

image

Above four customers are in the Customer_Table. I have color coded them for you to show how joins work. There are 16 orders in the Order_Total that have been placed by the four customers. The red customer has placed four orders, the pink customer has placed two orders, the green customer has placed five orders and the blue customer has placed five orders. To join the Customer_Table to the Order_Table, Hadoop will have to provide a shuffle join. Watch what happens on the next page.

The Shuffle Join (2 of 2)

image

Hadoop will shuffle the matching rows to the same node's memory in order to join the data. Hadoop uses a math formula (called a Hash Formula) to do this. They just run the Hash Formula on the Customer_Number of each order. This is the most expensive technique for getting the matching rows in the same node's memory.

Shuffle

image

SELECT

 C.Customer_Number, C.Phone_Number ,C.Customer_Name

,O.Customer_Number, O.Order_Number, O.Order_Date, Order_Total

FROM Customer_Table as C

INNER JOIN

Order_Table as O

ON C.Customer_Number = O.Customer_Number ;

Hadoop can shuffle (temporarily) by re-hashing Customer_Number from the Order_Table and Customer_Table. Now, all joining rows will be on the same node's memory. That is one way to get matching rows together.

Map Strategy

image

Hadoop loves when a table is small enough to fit completely in memory. Turn the page and be prepared to be amazed!

Duplication of the Smaller Table across All-AMPs

image

Hadoop took the Department_Table and gathered up all 4-rows (temporarily) and in memory Duplicated the entire 4-row Table across all nodes. Now the joins can happen! This is the fastest way to get rows together. If one table is much bigger than the other, Hadoop will duplicate the smaller table on all nodes, just for the life of the query. This is a map join.

Using Buckets for Table Joins

image

These tables join on Customer_Number, so cluster them into
the same amount of buckets on the join key and speed
happens, because the matching rows are on the same node
.

Hive tables can be organized into buckets, which will determine the way the underlying files are stored. Bucketing is excellent because joins perform faster when the cluster by uses the join columns and the same amount of buckets. Sampling is also more efficient because the data is already split up into smaller pieces. The hive formula is consistent because they hash the column value and then uses the modulo operator on the hash value.

Sort-Merge Bucket Join Needs No Shuffling or Duplication

image

Hadoop always moves data from disk into memory to process a query, but there will be no shuffling because the two tables already have the matching rows on the same node. This was done by using the bucket method from the previous page.

A Two-Table Join Using Traditional Join Syntax

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship, so this join will happen on matching Customer_Number columns.

A Two-Table Join Using ANSI Syntax

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship, so this join will happen on matching Customer_Number columns. ANSI joins use the keyword INNER JOIN with an ON Clause.

Traditional Join Using a Table Alias

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means from which column from each table establishes the relationship. Customer_Number is the joining column above.

ANSI Join Using a Table Alias

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

ANSI Join Using a Table Alias with Keyword AS

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

ANSI Join Using the Keyword JOIN Instead of INNER JOIN

image

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

You Can Fully Qualify All Columns for Clarity

image

Whenever a column is in both tables, you must fully qualify it when doing a join. You don't have to fully qualify tables that are only in one of the tables because the system knows which table that particular column is in. You can choose to fully qualify every column if you like. This is a good practice for anyone else looking at your SQL because it is more apparent which columns belong to which tables.

A Two-Table Join in Action

image

The only rows that return on an INNER JOIN come back because both tables have a matching Customer_Number in both tables. Notice that we had to fully qualify the column Customer_Number because it is in both tables.

Quiz – Can You Finish the Join Syntax?

image

Finish this join by placing the missing SQL in the proper place!

Answer to Quiz – Can You Finish the Join Syntax?

image

This query is ready to run.

Another Way to Write a Join

image

This query is ready to run.

A Cartesian Product Join

image

This query is usually a mistake. There are five customers who have placed six orders and that is why 30 rows came back. A Cartesian Product join will join every row from both tables together so 30 rows returned (6 * 5 =30).

Quiz – Can You Find the Error?

image

This query has an error! Can you find it?

Answer to Quiz – Can You Find the Error?

image

If a column in the SELECT list is in both tables, you must fully qualify it.

Super Quiz – Can You Find the Difficult Error?

image

This query has an error! Can you find it?

Answer to Super Quiz – Can You Find the Difficult Error?

image

If a column in the SELECT list is in both tables, you must fully qualify it. Yet, once you alias a table you cannot use the full tablename again. The system thinks there are three tables now (E, D and Employee_Table) and it errors.

Quiz – Which rows from both tables Won’t Return?

image

An Inner Join returns matching rows. There are three rows in the tables above that will not return on an INNER JOIN. We can only get them to return in a join by using an Outer Join. You will understand soon!

Answer to Quiz – Which rows from both tables Won’t Return?

image

The bottom line is that the three rows excluded did not have a matching Dept_No.

LEFT OUTER JOIN

image

This is a LEFT OUTER JOIN. That means that all rows from the LEFT Table will appear in the report regardless if it finds a match on the right table.

LEFT OUTER JOIN Results

image

A LEFT Outer Join Returns all rows from the LEFT Table including all Matches. If a LEFT row can’t find a match, a NULL is placed on right columns not found!

A LEFT SEMI JOIN Replaces a Subquery

image

Have you ever seen a LEFT SEMI JOIN? This can be done to replace a subquery.

What is a LEFT SEMI JOIN?

image

The common key among both tables is Customer_Number. The LEFT SEMI JOIN in essence accomplishes what a subquery does, but in a join fashion. Hive does not support IN, EXISTS or subqueries in the WHERE clause. You need to write subqueries as a join. A LEFT SEMI JOIN is efficient because no data from the RIGHT table is returned in the answer set.

RIGHT OUTER JOIN

image

This is a RIGHT OUTER JOIN. That means that all rows from the RIGHT Table will appear in the report regardless if it finds a match with the LEFT Table.

RIGHT OUTER JOIN Example and Results

image

All rows from the Right Table were returned with matches, but since Dept_No 500 didn’t have a match, the system put a NULL Value for Left Column values.

FULL OUTER JOIN

image

The is a FULL OUTER JOIN. This means that all rows from both the RIGHT and LEFT Table will appear in the report regardless if it finds a match.

FULL OUTER JOIN Results

image

The FULL Outer Join Returns all rows from both Tables. NULLs show the flaws! Both tables are considered outer tables.

Which Tables are the Left and Which are the Right?

image

Can you list which tables above are left tables and which tables are right tables?

Answer - Which Tables are the Left and Which are the Right?

image

The first table is always the left table and the rest are right tables. The results from the first two tables being joined becomes the left table.

INNER JOIN with an Additional WHERE Clause

image

The additional WHERE is performed first in order to eliminate unwanted data, so the join is less intensive than joining everything first and then eliminating rows that don't qualify.

INNER JOIN with an Additional AND Clause

image

The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than joining rows first and then eliminating rows that don't qualify. An additional AND or WHERE clause is equivalent in inner joins.

OUTER JOIN with Additional WHERE Clause

image

The additional WHERE is performed last on Outer Joins. All rows will be joined first and then the additional WHERE clause filters after the join takes place.

OUTER JOIN with Additional AND Clause

image

The additional AND is performed in conjunction with the ON statement on Outer Joins. All rows will be evaluated with the ON clause and the AND combined.

OUTER JOIN with Additional AND Clause Results

image

The additional AND is performed in conjunction with the ON statement on Outer Joins. This can surprise you. Only Mandee is in Dept_No 100, so she showed up like expected, but an outer join returns non-matches also. Ouch!!!

Quiz – Why is this Considered an INNER JOIN?

image

This is considered an INNER JOIN because we are doing a LEFT OUTER JOIN on the Employee_Table and then filtering with the WHERE for a column in the right table!

Evaluation Order for Outer Queries

SELECT Cou.*, STU1.*

FROM    COURSE_TABLE Cou

LEFT OUTER JOIN

STUDENT_COURSE_TABLE STU

ON Cou.Course_Id = STU.Course_Id

LEFT OUTER JOIN STUDENT_TABLESTU1

ON STU.Student_Id = STU1.Student_Id;

image

When you perform an inner join, Hadoop considers this to be both commutative and associative. That means that two tables being inner joined will easily come up with the intended answer. This allows the optimizer to select the best join order between tables. This is because the end result will be the same. Outer Joins are different. They will follow the above three rules for evaluation order by the optimizer.

Cartesian Product Join

image

Cartesian Product Joins are usually a mistake, but sometimes it is done to show all possible combinations for stress testing a system for performance tests.

The CROSS JOIN with a WHERE Clause

image

This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next page.

The CROSS JOIN with a WHERE Clause Answer Set

image

This Cross Join produces information that quite often isn’t worth anything, but sometimes it is used for comparisons. For example, you might want to compare the flight costs at your local airport with flight costs from every other airline.

The CROSS JOIN with an ON Clause

image

This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next page.

The CROSS JOIN with an ON Clause Answer Set

image

This Cross Join produces information that quite often isn’t worth anything, but sometimes it is used for comparisons. For example, you might want to compare the flight costs at your local airport with flight costs from every other airline.

The Self Join

image

A Self Join gives itself two different aliases, which is then seen as two different tables.

How would you Join these two tables?

image

How would you join these two tables together? You can’t do it. There is no matching column with like data. There is no Primary Key/Foreign Key relationship between these two tables. That is why you are about to be introduced to a bridge table. It is formally called an Associative table or a Lookup table.

An Associative Table is a Bridge that Joins Two Tables

image

The Associative Table is a bridge between the Course_Table and Student_Table.

Quiz – Can you Write the 3-Table Join?

image

SELECT ALL Columns from the Course_Table and Student_Table and Join them.

Answer – Can you Write the 3-Table Join?

image

Select S.*, C.*

From Student_Table S

 INNER JOIN

  Student_Course_Table SC

ON S.Student_ID = SC.Student_ID

 INNER JOIN

  Course_Table C

ON C.Course_ID = SC.Course_ID;

The above queries show both traditional and ANSI form for this three table join.

Quiz – Can you Place the ON Clauses at the End?

image

Please re-write the above query and place both ON Clauses at the end.

Answer – Can you Place the ON Clauses at the End?

image

This is tricky. The only way it works is to place the ON clauses backwards. The ON Clause represents the last INNER JOIN and then moves backwards. Then you add the AND clause. Wow!

The 5-Table Join – Logical Insurance Model

image

Above is the logical model for the insurance tables showing the Primary Key and Foreign Key relationships (PK/FK).

Quiz - Write a Five Table Join Using ANSI Syntax

image

Your mission is to write a five table join selecting all columns using ANSI syntax.

Answer - Write a Five Table Join Using ANSI Syntax

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMCLAIMS cla1

INNER JOIN

SUBSCRIBERS  sub1

ONcla1.Subscriber_No = sub1.Subscriber_No

ANDcla1.Member_No = sub1.Member_No

INNER JOIN

ADDRESSES  add1

ONsub1.Subscriber_No = add1.Subscriber_No

INNER JOIN

PROVIDERS  pro1

ONcla1.Provider_No = pro1.Provider_Code

INNER JOIN

SERVICES  ser1

ONcla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using ANSI syntax.

Quiz –Re-Write this putting the ON clauses at the END

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMCLAIMS cla1

INNER JOIN

SUBSCRIBERS  sub1

ONcla1.Subscriber_No = sub1.Subscriber_No

ANDcla1.Member_No = sub1.Member_No

INNER JOIN

ADDRESSES  add1

ONsub1.Subscriber_No = add1.Subscriber_No

INNER JOIN

PROVIDERS  pro1

ONcla1.Provider_No = pro1.Provider_Code

INNER JOIN

SERVICES  ser1

ONcla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using Non-ANSI syntax.

Answer –Re-Write this putting the ON clauses at the END

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMPROVIDERS pro1

INNER JOIN SERVICES ser1

INNER JOIN ADDRESSES add1

INNER JOIN SUBSCRIBERS sub1

INNER JOIN CLAIMS cla1

ON        cla1.Subscriber_No = sub1.Subscriber_No

AND     cla1.Member_No = sub1.Member_No

AND     sub1.Subscriber_No = add1.Subscriber_No

AND     cla1.Claim_Service = ser1.Service_Code

AND     cla1.Provider_No = pro1.Provider_Code

Above is the example writing this five table join using ANSI syntax with the ON clauses at the end. Additionally, we had to move the tables around to make this happen. Notice that the first ON clause represents the last two tables being joined (Claims and Subscribers), and then the next AND clause works backwards to the next table (Addresses). The key is to remember that the first ON/AND clauses represent the last tables mentioned in the INNER JOIN statement.

The Nexus Query Chameleon Writes the SQL for Users.

image

Let Nexus show users the table relationships and then let Nexus build the SQL. Just load the ERwin logical model inside Nexus and then all users can point and click.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset