Chapter 6 – Join Functions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6 – Join Functions

“When spider webs unite they can tie up a lion.”

- African Proverb

Hadoop Joins

1.Each Table in Hadoop could hold a portion of a table.

2.For two rows to be Joined together, Hadoop insists that both rows are physically on the same reducer.

3.Hadoop will either Redistribute one or both of the tables or Duplicate the smaller table across all reducers to ensure matching rows are on the same reducer. This is done only for the life of the Join.

The three options are a Shuffle, Map or Sort Merge-Bucket Join

Two joining rows have to be in the same memory of a single reducer. The three options are a Shuffle join, Map join or a Sort Merge-Bucket Join.

The Shuffle Join (1 of 2)

Above four customers are in the Customer_Table. I have color coded them for you to show how joins work. There are 16 orders in the Order_Total that have been placed by the four customers. The red customer has placed four orders, the pink customer has placed two orders, the green customer has placed five orders and the blue customer has placed five orders. To join the Customer_Table to the Order_Table, Hadoop will have to provide a shuffle join. Watch what happens on the next page.

The Shuffle Join (2 of 2)

Hadoop will shuffle the matching rows to the same node's memory in order to join the data. Hadoop uses a math formula (called a Hash Formula) to do this. They just run the Hash Formula on the Customer_Number of each order. This is the most expensive technique for getting the matching rows in the same node's memory.

Shuffle

SELECT

C.Customer_Number, C.Phone_Number ,C.Customer_Name

,O.Customer_Number, O.Order_Number, O.Order_Date, Order_Total

FROM Customer_Table as C

INNER JOIN

Order_Table as O

ON C.Customer_Number = O.Customer_Number ;

Hadoop can shuffle (temporarily) by re-hashing Customer_Number from the Order_Table and Customer_Table. Now, all joining rows will be on the same node's memory. That is one way to get matching rows together.

Map Strategy

Hadoop loves when a table is small enough to fit completely in memory. Turn the page and be prepared to be amazed!

Duplication of the Smaller Table across All-AMPs

Hadoop took the Department_Table and gathered up all 4-rows (temporarily) and in memory Duplicated the entire 4-row Table across all nodes. Now the joins can happen! This is the fastest way to get rows together. If one table is much bigger than the other, Hadoop will duplicate the smaller table on all nodes, just for the life of the query. This is a map join.

Using Buckets for Table Joins

These tables join on Customer_Number, so cluster them into
the same amount of buckets on the join key and speed
happens, because the matching rows are on the same node.

Hive tables can be organized into buckets, which will determine the way the underlying files are stored. Bucketing is excellent because joins perform faster when the cluster by uses the join columns and the same amount of buckets. Sampling is also more efficient because the data is already split up into smaller pieces. The hive formula is consistent because they hash the column value and then uses the modulo operator on the hash value.

Sort-Merge Bucket Join Needs No Shuffling or Duplication

Hadoop always moves data from disk into memory to process a query, but there will be no shuffling because the two tables already have the matching rows on the same node. This was done by using the bucket method from the previous page.

A Two-Table Join Using Traditional Join Syntax

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship, so this join will happen on matching Customer_Number columns.

A Two-Table Join Using ANSI Syntax

Traditional Join Using a Table Alias

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means from which column from each table establishes the relationship. Customer_Number is the joining column above.

ANSI Join Using a Table Alias

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

ANSI Join Using a Table Alias with Keyword AS

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

ANSI Join Using the Keyword JOIN Instead of INNER JOIN

A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.

You Can Fully Qualify All Columns for Clarity

Whenever a column is in both tables, you must fully qualify it when doing a join. You don't have to fully qualify tables that are only in one of the tables because the system knows which table that particular column is in. You can choose to fully qualify every column if you like. This is a good practice for anyone else looking at your SQL because it is more apparent which columns belong to which tables.

A Two-Table Join in Action

The only rows that return on an INNER JOIN come back because both tables have a matching Customer_Number in both tables. Notice that we had to fully qualify the column Customer_Number because it is in both tables.

Quiz – Can You Finish the Join Syntax?

Finish this join by placing the missing SQL in the proper place!

Answer to Quiz – Can You Finish the Join Syntax?

This query is ready to run.

Another Way to Write a Join

This query is ready to run.

A Cartesian Product Join

This query is usually a mistake. There are five customers who have placed six orders and that is why 30 rows came back. A Cartesian Product join will join every row from both tables together so 30 rows returned (6 * 5 =30).

Quiz – Can You Find the Error?

This query has an error! Can you find it?

Answer to Quiz – Can You Find the Error?

If a column in the SELECT list is in both tables, you must fully qualify it.

Super Quiz – Can You Find the Difficult Error?

This query has an error! Can you find it?

Answer to Super Quiz – Can You Find the Difficult Error?

If a column in the SELECT list is in both tables, you must fully qualify it. Yet, once you alias a table you cannot use the full tablename again. The system thinks there are three tables now (E, D and Employee_Table) and it errors.

Quiz – Which rows from both tables Won’t Return?

An Inner Join returns matching rows. There are three rows in the tables above that will not return on an INNER JOIN. We can only get them to return in a join by using an Outer Join. You will understand soon!

Answer to Quiz – Which rows from both tables Won’t Return?

The bottom line is that the three rows excluded did not have a matching Dept_No.

LEFT OUTER JOIN

This is a LEFT OUTER JOIN. That means that all rows from the LEFT Table will appear in the report regardless if it finds a match on the right table.

LEFT OUTER JOIN Results

A LEFT Outer Join Returns all rows from the LEFT Table including all Matches. If a LEFT row can’t find a match, a NULL is placed on right columns not found!

A LEFT SEMI JOIN Replaces a Subquery

Have you ever seen a LEFT SEMI JOIN? This can be done to replace a subquery.

What is a LEFT SEMI JOIN?

The common key among both tables is Customer_Number. The LEFT SEMI JOIN in essence accomplishes what a subquery does, but in a join fashion. Hive does not support IN, EXISTS or subqueries in the WHERE clause. You need to write subqueries as a join. A LEFT SEMI JOIN is efficient because no data from the RIGHT table is returned in the answer set.

RIGHT OUTER JOIN

This is a RIGHT OUTER JOIN. That means that all rows from the RIGHT Table will appear in the report regardless if it finds a match with the LEFT Table.

RIGHT OUTER JOIN Example and Results

All rows from the Right Table were returned with matches, but since Dept_No 500 didn’t have a match, the system put a NULL Value for Left Column values.

FULL OUTER JOIN

The is a FULL OUTER JOIN. This means that all rows from both the RIGHT and LEFT Table will appear in the report regardless if it finds a match.

FULL OUTER JOIN Results

The FULL Outer Join Returns all rows from both Tables. NULLs show the flaws! Both tables are considered outer tables.

Which Tables are the Left and Which are the Right?

Can you list which tables above are left tables and which tables are right tables?

Answer - Which Tables are the Left and Which are the Right?

The first table is always the left table and the rest are right tables. The results from the first two tables being joined becomes the left table.

INNER JOIN with an Additional WHERE Clause

The additional WHERE is performed first in order to eliminate unwanted data, so the join is less intensive than joining everything first and then eliminating rows that don't qualify.

INNER JOIN with an Additional AND Clause

The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than joining rows first and then eliminating rows that don't qualify. An additional AND or WHERE clause is equivalent in inner joins.

OUTER JOIN with Additional WHERE Clause

The additional WHERE is performed last on Outer Joins. All rows will be joined first and then the additional WHERE clause filters after the join takes place.

OUTER JOIN with Additional AND Clause

The additional AND is performed in conjunction with the ON statement on Outer Joins. All rows will be evaluated with the ON clause and the AND combined.

OUTER JOIN with Additional AND Clause Results

The additional AND is performed in conjunction with the ON statement on Outer Joins. This can surprise you. Only Mandee is in Dept_No 100, so she showed up like expected, but an outer join returns non-matches also. Ouch!!!

Quiz – Why is this Considered an INNER JOIN?

This is considered an INNER JOIN because we are doing a LEFT OUTER JOIN on the Employee_Table and then filtering with the WHERE for a column in the right table!

Evaluation Order for Outer Queries

SELECT Cou.*, STU1.*

FROM COURSE_TABLE Cou

LEFT OUTER JOIN

STUDENT_COURSE_TABLE STU

ON Cou.Course_Id = STU.Course_Id

LEFT OUTER JOIN STUDENT_TABLESTU1

ON STU.Student_Id = STU1.Student_Id;

When you perform an inner join, Hadoop considers this to be both commutative and associative. That means that two tables being inner joined will easily come up with the intended answer. This allows the optimizer to select the best join order between tables. This is because the end result will be the same. Outer Joins are different. They will follow the above three rules for evaluation order by the optimizer.

Cartesian Product Join

Cartesian Product Joins are usually a mistake, but sometimes it is done to show all possible combinations for stress testing a system for performance tests.

The CROSS JOIN with a WHERE Clause

This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next page.

The CROSS JOIN with a WHERE Clause Answer Set

This Cross Join produces information that quite often isn’t worth anything, but sometimes it is used for comparisons. For example, you might want to compare the flight costs at your local airport with flight costs from every other airline.

The CROSS JOIN with an ON Clause

The CROSS JOIN with an ON Clause Answer Set

The Self Join

A Self Join gives itself two different aliases, which is then seen as two different tables.

How would you Join these two tables?

How would you join these two tables together? You can’t do it. There is no matching column with like data. There is no Primary Key/Foreign Key relationship between these two tables. That is why you are about to be introduced to a bridge table. It is formally called an Associative table or a Lookup table.

An Associative Table is a Bridge that Joins Two Tables

The Associative Table is a bridge between the Course_Table and Student_Table.

Quiz – Can you Write the 3-Table Join?

SELECT ALL Columns from the Course_Table and Student_Table and Join them.

Answer – Can you Write the 3-Table Join?

Select S.*, C.*

From Student_Table S

INNER JOIN

Student_Course_Table SC

ON S.Student_ID = SC.Student_ID

INNER JOIN

Course_Table C

ON C.Course_ID = SC.Course_ID;

The above queries show both traditional and ANSI form for this three table join.

Quiz – Can you Place the ON Clauses at the End?

Please re-write the above query and place both ON Clauses at the end.

Answer – Can you Place the ON Clauses at the End?

This is tricky. The only way it works is to place the ON clauses backwards. The ON Clause represents the last INNER JOIN and then moves backwards. Then you add the AND clause. Wow!

The 5-Table Join – Logical Insurance Model

Above is the logical model for the insurance tables showing the Primary Key and Foreign Key relationships (PK/FK).

Quiz - Write a Five Table Join Using ANSI Syntax

Your mission is to write a five table join selecting all columns using ANSI syntax.

Answer - Write a Five Table Join Using ANSI Syntax

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMCLAIMS cla1

INNER JOIN

SUBSCRIBERS sub1

ONcla1.Subscriber_No = sub1.Subscriber_No

ANDcla1.Member_No = sub1.Member_No

INNER JOIN

ADDRESSES add1

ONsub1.Subscriber_No = add1.Subscriber_No

INNER JOIN

PROVIDERS pro1

ONcla1.Provider_No = pro1.Provider_Code

INNER JOIN

SERVICES ser1

ONcla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using ANSI syntax.

Quiz –Re-Write this putting the ON clauses at the END

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMCLAIMS cla1

INNER JOIN

SUBSCRIBERS sub1

ONcla1.Subscriber_No = sub1.Subscriber_No

ANDcla1.Member_No = sub1.Member_No

INNER JOIN

ADDRESSES add1

ONsub1.Subscriber_No = add1.Subscriber_No

INNER JOIN

PROVIDERS pro1

ONcla1.Provider_No = pro1.Provider_Code

INNER JOIN

SERVICES ser1

ONcla1.Claim_Service = ser1.Service_Code ;

Above is the example writing this five table join using Non-ANSI syntax.

Answer –Re-Write this putting the ON clauses at the END

SELECT

cla1.*, sub1.*, add1.* ,pro1.*, ser1.*

FROMPROVIDERS pro1

INNER JOIN SERVICES ser1

INNER JOIN ADDRESSES add1

INNER JOIN SUBSCRIBERS sub1

INNER JOIN CLAIMS cla1

ON cla1.Subscriber_No = sub1.Subscriber_No

AND cla1.Member_No = sub1.Member_No

AND sub1.Subscriber_No = add1.Subscriber_No

AND cla1.Claim_Service = ser1.Service_Code

AND cla1.Provider_No = pro1.Provider_Code

Above is the example writing this five table join using ANSI syntax with the ON clauses at the end. Additionally, we had to move the tables around to make this happen. Notice that the first ON clause represents the last two tables being joined (Claims and Subscribers), and then the next AND clause works backwards to the next table (Addresses). The key is to remember that the first ON/AND clauses represent the last tables mentioned in the INNER JOIN statement.

The Nexus Query Chameleon Writes the SQL for Users.

Let Nexus show users the table relationships and then let Nexus build the SQL. Just load the ERwin logical model inside Nexus and then all users can point and click.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6 – Join Functions

Create new playlist

Sign In

Sign Up

Chapter 6 – Join Functions

Hadoop Joins

The Shuffle Join (1 of 2)

The Shuffle Join (2 of 2)

Shuffle

Map Strategy

Duplication of the Smaller Table across All-AMPs

Using Buckets for Table Joins

Sort-Merge Bucket Join Needs No Shuffling or Duplication

A Two-Table Join Using Traditional Join Syntax

A Two-Table Join Using ANSI Syntax

Traditional Join Using a Table Alias

ANSI Join Using a Table Alias

ANSI Join Using a Table Alias with Keyword AS

ANSI Join Using the Keyword JOIN Instead of INNER JOIN

You Can Fully Qualify All Columns for Clarity

A Two-Table Join in Action

Quiz – Can You Finish the Join Syntax?

Answer to Quiz – Can You Finish the Join Syntax?

Another Way to Write a Join

A Cartesian Product Join

Quiz – Can You Find the Error?

Answer to Quiz – Can You Find the Error?

Super Quiz – Can You Find the Difficult Error?

Answer to Super Quiz – Can You Find the Difficult Error?

Quiz – Which rows from both tables Won’t Return?

Answer to Quiz – Which rows from both tables Won’t Return?

LEFT OUTER JOIN

LEFT OUTER JOIN Results

A LEFT SEMI JOIN Replaces a Subquery

What is a LEFT SEMI JOIN?

RIGHT OUTER JOIN

RIGHT OUTER JOIN Example and Results

FULL OUTER JOIN

FULL OUTER JOIN Results

Which Tables are the Left and Which are the Right?

Answer - Which Tables are the Left and Which are the Right?

INNER JOIN with an Additional WHERE Clause

INNER JOIN with an Additional AND Clause

OUTER JOIN with Additional WHERE Clause

OUTER JOIN with Additional AND Clause

OUTER JOIN with Additional AND Clause Results

Quiz – Why is this Considered an INNER JOIN?

Evaluation Order for Outer Queries

Cartesian Product Join

The CROSS JOIN with a WHERE Clause

The CROSS JOIN with a WHERE Clause Answer Set

The CROSS JOIN with an ON Clause

The CROSS JOIN with an ON Clause Answer Set

The Self Join

How would you Join these two tables?

An Associative Table is a Bridge that Joins Two Tables

Quiz – Can you Write the 3-Table Join?

Answer – Can you Write the 3-Table Join?

Quiz – Can you Place the ON Clauses at the End?

Answer – Can you Place the ON Clauses at the End?

The 5-Table Join – Logical Insurance Model

Quiz - Write a Five Table Join Using ANSI Syntax

Answer - Write a Five Table Join Using ANSI Syntax

Quiz –Re-Write this putting the ON clauses at the END

Answer –Re-Write this putting the ON clauses at the END

The Nexus Query Chameleon Writes the SQL for Users.

Table of Contents for
Chapter 6 – Join Functions