“When spider webs unite they can tie up a lion.”
- African Proverb
1.Each Table in Hadoop could hold a portion of a table.
2.For two rows to be Joined together, Hadoop insists that both rows are physically on the same reducer.
3.Hadoop will either Redistribute one or both of the tables or Duplicate the smaller table across all reducers to ensure matching rows are on the same reducer. This is done only for the life of the Join.
The three options are a Shuffle, Map or Sort Merge-Bucket Join
Two joining rows have to be in the same memory of a single reducer. The three options are a Shuffle join, Map join or a Sort Merge-Bucket Join.
Above four customers are in the Customer_Table. I have color coded them for you to show how joins work. There are 16 orders in the Order_Total that have been placed by the four customers. The red customer has placed four orders, the pink customer has placed two orders, the green customer has placed five orders and the blue customer has placed five orders. To join the Customer_Table to the Order_Table, Hadoop will have to provide a shuffle join. Watch what happens on the next page.
Hadoop will shuffle the matching rows to the same node's memory in order to join the data. Hadoop uses a math formula (called a Hash Formula) to do this. They just run the Hash Formula on the Customer_Number of each order. This is the most expensive technique for getting the matching rows in the same node's memory.
SELECT
C.Customer_Number, C.Phone_Number ,C.Customer_Name
,O.Customer_Number, O.Order_Number, O.Order_Date, Order_Total
FROM Customer_Table as C
INNER JOIN
Order_Table as O
ON C.Customer_Number = O.Customer_Number ;
Hadoop can shuffle (temporarily) by re-hashing Customer_Number from the Order_Table and Customer_Table. Now, all joining rows will be on the same node's memory. That is one way to get matching rows together.
Hadoop loves when a table is small enough to fit completely in memory. Turn the page and be prepared to be amazed!
Hadoop took the Department_Table and gathered up all 4-rows (temporarily) and in memory Duplicated the entire 4-row Table across all nodes. Now the joins can happen! This is the fastest way to get rows together. If one table is much bigger than the other, Hadoop will duplicate the smaller table on all nodes, just for the life of the query. This is a map join.
These tables join on Customer_Number, so cluster them into
the same amount of buckets on the join key and speed
happens, because the matching rows are on the same node.
Hive tables can be organized into buckets, which will determine the way the underlying files are stored. Bucketing is excellent because joins perform faster when the cluster by uses the join columns and the same amount of buckets. Sampling is also more efficient because the data is already split up into smaller pieces. The hive formula is consistent because they hash the column value and then uses the modulo operator on the hash value.
Hadoop always moves data from disk into memory to process a query, but there will be no shuffling because the two tables already have the matching rows on the same node. This was done by using the bucket method from the previous page.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship, so this join will happen on matching Customer_Number columns.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION is which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship, so this join will happen on matching Customer_Number columns. ANSI joins use the keyword INNER JOIN with an ON Clause.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means from which column from each table establishes the relationship. Customer_Number is the joining column above.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.
A Join combines columns on the report from more than one table. The example above joins the Customer_Table and the Order_Table together. The most complicated part of any join is the JOIN CONDITION. The JOIN CONDITION means which Column from each table is a match. In this case, Customer_Number is a match that establishes the relationship.
Whenever a column is in both tables, you must fully qualify it when doing a join. You don't have to fully qualify tables that are only in one of the tables because the system knows which table that particular column is in. You can choose to fully qualify every column if you like. This is a good practice for anyone else looking at your SQL because it is more apparent which columns belong to which tables.
The only rows that return on an INNER JOIN come back because both tables have a matching Customer_Number in both tables. Notice that we had to fully qualify the column Customer_Number because it is in both tables.
Finish this join by placing the missing SQL in the proper place!
This query is ready to run.
This query is ready to run.
This query is usually a mistake. There are five customers who have placed six orders and that is why 30 rows came back. A Cartesian Product join will join every row from both tables together so 30 rows returned (6 * 5 =30).
This query has an error! Can you find it?
If a column in the SELECT list is in both tables, you must fully qualify it.
This query has an error! Can you find it?
If a column in the SELECT list is in both tables, you must fully qualify it. Yet, once you alias a table you cannot use the full tablename again. The system thinks there are three tables now (E, D and Employee_Table) and it errors.
An Inner Join returns matching rows. There are three rows in the tables above that will not return on an INNER JOIN. We can only get them to return in a join by using an Outer Join. You will understand soon!
The bottom line is that the three rows excluded did not have a matching Dept_No.
This is a LEFT OUTER JOIN. That means that all rows from the LEFT Table will appear in the report regardless if it finds a match on the right table.
A LEFT Outer Join Returns all rows from the LEFT Table including all Matches. If a LEFT row can’t find a match, a NULL is placed on right columns not found!
Have you ever seen a LEFT SEMI JOIN? This can be done to replace a subquery.
The common key among both tables is Customer_Number. The LEFT SEMI JOIN in essence accomplishes what a subquery does, but in a join fashion. Hive does not support IN, EXISTS or subqueries in the WHERE clause. You need to write subqueries as a join. A LEFT SEMI JOIN is efficient because no data from the RIGHT table is returned in the answer set.
This is a RIGHT OUTER JOIN. That means that all rows from the RIGHT Table will appear in the report regardless if it finds a match with the LEFT Table.
All rows from the Right Table were returned with matches, but since Dept_No 500 didn’t have a match, the system put a NULL Value for Left Column values.
The is a FULL OUTER JOIN. This means that all rows from both the RIGHT and LEFT Table will appear in the report regardless if it finds a match.
The FULL Outer Join Returns all rows from both Tables. NULLs show the flaws! Both tables are considered outer tables.
Can you list which tables above are left tables and which tables are right tables?
The first table is always the left table and the rest are right tables. The results from the first two tables being joined becomes the left table.
The additional WHERE is performed first in order to eliminate unwanted data, so the join is less intensive than joining everything first and then eliminating rows that don't qualify.
The additional AND is performed first in order to eliminate unwanted data, so the join is less intensive than joining rows first and then eliminating rows that don't qualify. An additional AND or WHERE clause is equivalent in inner joins.
The additional WHERE is performed last on Outer Joins. All rows will be joined first and then the additional WHERE clause filters after the join takes place.
The additional AND is performed in conjunction with the ON statement on Outer Joins. All rows will be evaluated with the ON clause and the AND combined.
The additional AND is performed in conjunction with the ON statement on Outer Joins. This can surprise you. Only Mandee is in Dept_No 100, so she showed up like expected, but an outer join returns non-matches also. Ouch!!!
This is considered an INNER JOIN because we are doing a LEFT OUTER JOIN on the Employee_Table and then filtering with the WHERE for a column in the right table!
SELECT Cou.*, STU1.*
FROM COURSE_TABLE Cou
LEFT OUTER JOIN
STUDENT_COURSE_TABLE STU
ON Cou.Course_Id = STU.Course_Id
LEFT OUTER JOIN STUDENT_TABLESTU1
ON STU.Student_Id = STU1.Student_Id;
When you perform an inner join, Hadoop considers this to be both commutative and associative. That means that two tables being inner joined will easily come up with the intended answer. This allows the optimizer to select the best join order between tables. This is because the end result will be the same. Outer Joins are different. They will follow the above three rules for evaluation order by the optimizer.
Cartesian Product Joins are usually a mistake, but sometimes it is done to show all possible combinations for stress testing a system for performance tests.
This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next page.
This Cross Join produces information that quite often isn’t worth anything, but sometimes it is used for comparisons. For example, you might want to compare the flight costs at your local airport with flight costs from every other airline.
This query becomes a Product Join because a Cross Join is an ANSI Product Join. It will compare every row from the Customer_Table to Order_Number 123456 in the Order_Table. Check out the Answer Set on the next page.
This Cross Join produces information that quite often isn’t worth anything, but sometimes it is used for comparisons. For example, you might want to compare the flight costs at your local airport with flight costs from every other airline.
A Self Join gives itself two different aliases, which is then seen as two different tables.
How would you join these two tables together? You can’t do it. There is no matching column with like data. There is no Primary Key/Foreign Key relationship between these two tables. That is why you are about to be introduced to a bridge table. It is formally called an Associative table or a Lookup table.
The Associative Table is a bridge between the Course_Table and Student_Table.
SELECT ALL Columns from the Course_Table and Student_Table and Join them.
Select S.*, C.*
From Student_Table S
INNER JOIN
Student_Course_Table SC
ON S.Student_ID = SC.Student_ID
INNER JOIN
Course_Table C
ON C.Course_ID = SC.Course_ID;
The above queries show both traditional and ANSI form for this three table join.
Please re-write the above query and place both ON Clauses at the end.
This is tricky. The only way it works is to place the ON clauses backwards. The ON Clause represents the last INNER JOIN and then moves backwards. Then you add the AND clause. Wow!
Above is the logical model for the insurance tables showing the Primary Key and Foreign Key relationships (PK/FK).
Your mission is to write a five table join selecting all columns using ANSI syntax.
SELECT
cla1.*, sub1.*, add1.* ,pro1.*, ser1.*
FROMCLAIMS cla1
INNER JOIN
SUBSCRIBERS sub1
ONcla1.Subscriber_No = sub1.Subscriber_No
ANDcla1.Member_No = sub1.Member_No
INNER JOIN
ADDRESSES add1
ONsub1.Subscriber_No = add1.Subscriber_No
INNER JOIN
PROVIDERS pro1
ONcla1.Provider_No = pro1.Provider_Code
INNER JOIN
SERVICES ser1
ONcla1.Claim_Service = ser1.Service_Code ;
Above is the example writing this five table join using ANSI syntax.
SELECT
cla1.*, sub1.*, add1.* ,pro1.*, ser1.*
FROMCLAIMS cla1
INNER JOIN
SUBSCRIBERS sub1
ONcla1.Subscriber_No = sub1.Subscriber_No
ANDcla1.Member_No = sub1.Member_No
INNER JOIN
ADDRESSES add1
ONsub1.Subscriber_No = add1.Subscriber_No
INNER JOIN
PROVIDERS pro1
ONcla1.Provider_No = pro1.Provider_Code
INNER JOIN
SERVICES ser1
ONcla1.Claim_Service = ser1.Service_Code ;
Above is the example writing this five table join using Non-ANSI syntax.
SELECT
cla1.*, sub1.*, add1.* ,pro1.*, ser1.*
FROMPROVIDERS pro1
INNER JOIN SERVICES ser1
INNER JOIN ADDRESSES add1
INNER JOIN SUBSCRIBERS sub1
INNER JOIN CLAIMS cla1
ON cla1.Subscriber_No = sub1.Subscriber_No
AND cla1.Member_No = sub1.Member_No
AND sub1.Subscriber_No = add1.Subscriber_No
AND cla1.Claim_Service = ser1.Service_Code
AND cla1.Provider_No = pro1.Provider_Code
Above is the example writing this five table join using ANSI syntax with the ON clauses at the end. Additionally, we had to move the tables around to make this happen. Notice that the first ON clause represents the last two tables being joined (Claims and Subscribers), and then the next AND clause works backwards to the next table (Addresses). The key is to remember that the first ON/AND clauses represent the last tables mentioned in the INNER JOIN statement.
Let Nexus show users the table relationships and then let Nexus build the SQL. Just load the ERwin logical model inside Nexus and then all users can point and click.