“An invasion of Armies can be resisted, but not an idea whose time has come.”
- Victor Hugo
This query is easy to understand. It uses an IN List to find all Employees who are in Dept_No 100 or Dept_No 200.
What is going on with this IN List? Why in the world are their duplicates in there? Will this query even work? What will the result set look like? Duplicate values are ignored here. We got the same rows back as before, and it is as if the system ignored the duplicate values in the IN List. That is exactly what happened.
The query above is a Subquery which means there are multiple queries in the same SQL. The bottom query runs first, and its purpose is to build a distinct list of values that it passes to the top query. The top query then returns the result set. This query solves the problem: Show all Employees in Valid Departments!
The bottom query runs first and builds a distinct IN list. Then the top query runs using the list.
Both queries above are the same. Query 2 has values in an IN list. Query 1 runs a subquery to build the values in the IN list.
SELECT *FROM Employee_Table
WHERE Dept_No IN (SELECTDept_No FROM Department_Table) ;
How are Subqueries similar to Joins between two tables?
A great question was asked above. Do you know the key to answering? Turn the page!
How are Subqueries similar to Joins between two tables?
A Subquery between two tables or a Join between two tables will each need a common key
that represents the relationship. This is called a Primary Key/Foreign Key relationship.
A Subquery will use a common key linking the two tables together very similar to a join! When subquerying between two tables, look for the common link between the two tables. Most of the time they both have a column with the same name, but this is not always true.
If you only want to see a report where the final result set has only columns from one table, use a Subquery. Obviously, if you need columns on the report where the final result set has columns from both tables, you have to do a Join
Write the Subquery
Select all columns in the Customer_Table if the customer has placed an order!
Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the Customer_Table if the customer has placed an order in the Order_Table. Good luck! Advice: Look for the common key among both tables!
The common key among both tables is Customer_Number. The bottom query runs first and delivers a distinct list of Customer_Number values which the top query uses in the IN List!
Write the Subquery
Select all columns in the Customer_Table if the customer
has placed an order over $10,000.00 Dollars!
Here is your opportunity to show how smart you are. Write a Subquery that will bring back everything from the Customer_Table if the customer has placed an order in the Order_Table that is greater than $10,000.00.
Here is your answer!
Write the Subquery
Select all columns in the Employee_Table if the employee makes
a greater Salary than the AVERAGE Salary.
Another opportunity knocking! Would someone please answer the query door?
Hadoop won't allow you to use a subquery. However, you can do a LEFT SEMI JOIN. A LEFT SEMI JOIN will only place the results from the left table. Use the right table (TeraTom) as a qualifier.
Write the Correlated Subquery
Select all columns in the Employee_Table if the employee makes a
greater Salary than the AVERAGE Salary (within their own Department).
Another opportunity knocking! This is a tough one, and only the best get this written correctly.
Hadoop doesn't yet support a correlated subquery so you must use a derived table instead.
Write the Subquery
Select all columns in the Customer_Table if the
Customer has NOT placed an order.
Another opportunity knocking! Write the above query!
When a NOT IN subquery encounters a NULL value it returns nothing. Since the bottom query is passing up the Customer_Number to the top query if there are NULL values in any Customer_Number the top query returns nothing. That is why we utilized the IS NOT NULL in the bottom WHERE clause. That clause eliminates any NULL values from being passed to the top.
Write the Subquery
Select order_number, customer_number, and order_total
from the Customer_Table if an order was placed in the
Order_Table that by a customer with ‘Bill’ anywhere in
their name.
On the quiz above, take the opportunity to demonstrate what you have learned and show your brilliance!
Great job on writing your query just like the above.
How many rows return from the query now that a
NULL value is in a Customer_Number?
We really didn’t place a new row inside the Order_Table with a NULL value for the Customer_Number column, but in theory, if we had, how many rows would return?
How many rows return from the query now that a
NULL value is in a Customer_Number?
ZERO rows will return
The answer is no rows come back. This is because when you have a NULL value in a NOT IN list, the system doesn’t know the value of NULL, so it returns nothing.
You can utilize a WHERE clause that tests to make sure Customer_Number IS NOT NULL. This should be used when a NOT IN could encounter a NULL.
Use EXISTS to find which Customers have placed an Order?
SELECTCustomer_Number, Customer_Name
FROMCustomer_Table as Top1
WHEREEXISTS
(SELECT *FROM Order_Table as Bot1
Where Top1.Customer_Number = Bot1.Customer_Number ) ;
The EXISTS command will determine via a Boolean if something is True or False. If a customer placed an order, it EXISTS. Using the Correlated Exists statement will bring back customers who have placed an order. EXISTS is different than IN as it is less restrictive as you will soon understand.
Only customers who placed an order return with the above Correlated EXISTS. It is the final WHERE clause in the subquery that co-relates the Customer_Number from the top query to the Customer_Number of the bottom query. They are correlated.
The NOT EXISTS command determines via a Boolean if something is True or False. If a customer has not placed an order, it does not EXIST. Using the Correlated NOT Exists statement, only customers who have NOT placed an order will return in the answer set. NOT EXISTS is often used because it returns data even if NULL values are encountered. NOT EXISTS is different than NOT IN as it handles NULL values without any problems.