Now that we have seen how two or more tables can be combined in a join query, we turn our attention to another type of complex query known as a subquery. A subquery is a query expression that is nested within another query expression. Its purpose is to have the inner query produce a single value or multiple values that can then be passed into the outer query for processing. You achieve this by embedding a SELECT statement inside a WHERE clause of an outer query’s SELECT statement, INSERT statement, DELETE statement, or HAVING clause.
Note: | You should avoid nesting more than two subqueries deep because of the conceptual and processing complexities this introduces. |
The typical subquery consists of a (inner) query combined inside the predicate of another (outer or main) query. When processed, the inner query passes a Boolean value to the outer query consisting of either True if it returns a minimum of one row or False if no rows are returned by the subquery. The results of the inner query are stored in a temporary results table and used as input to the main query. Our exploration of subqueries will involve using them with comparison operators, the IN predicate, and the ANY and ALL keywords, and will conclude with a look at a special type of subquery called a correlated subquery.
A subquery is a very useful construct, especially when information from multiple tables needs to be interrelated. Unfortunately, a subquery is not always easy to construct and may even be more difficult to understand. So before constructing every table relation with a subquery, consider your options carefully.
When all the information is available in a single table, a simple query is probably all that needs to be constructed. Suppose you want to produce a report consisting of the invoice information for Global Comm Corp. Let’s further assume you know the specific manufacturer number for Global Comm Corp as well. Knowing this means that you don’t have to go into the MANUFACTURERS table to find it. In the next example, a simple query is constructed to retrieve all invoice information from the INVOICE table.
PROC SQL; SELECT * FROM INVOICE WHERE manunum = 210; QUIT;
Results
The SAS System Invoice Quantity Invoice Manufacturer Customer - Units Invoice Unit Product Number Number Number Sold Price Number ________________________________________________ __________________ 1003 210 101 7 $245.00 2101 |
But what if all the information is not in a single table? And what if the manufacturer number for Global Comm Corp is not known? As shown earlier, a join can be constructed just as easily as a subquery. Some users prefer joins to subqueries because they can be easier to understand as well as maintain. In fact, a join frequently performs better than a subquery. In the next example, the manufacturer number for Global Comm Corp is not known. Consequently, a simple inner join is needed to retrieve all related rows from the MANUFACTURERS and INVOICE tables for Global Comm Corp.
Simple Join
PROC SQL; SELECT M.manunum, M.manuname, I.invnum, I.invqty, I.invprice FROM MANUFACTURERS M, INVOICE I WHERE M.manunum = I.manunum AND M.manuname = 'Global Comm Corp'; QUIT;
The SAS System Invoice Quantity Manufacturer Invoice - Units Invoice Unit Number Manufacturer Name Number Sold Price _____________________________________________ _________________________ 210 Global Comm Corp 1003 7 $245.00 |
Now let’s see how a subquery could be constructed to provide the same results as with the join. As before, suppose you want to pull all the invoices for the manufacturer Global Comm Corp but only know the manufacturer name (or at least part of the name), but not the manufacturer number (MANUNUM). The following subquery uses an = (equal sign) in its outer query WHERE clause to accomplish this.
Since the manufacturer number is not known, a subquery is constructed to first search for it in the MANUFACTURERS table. Actually, the subquery approach is more versatile than the previous query approach, because it does not require a unique manufacturer number, which is often more difficult to remember than names. It also enables quick searches even if the manufacturer number changes for a given manufacturer.
When the entire query is executed, SQL first evaluates the inner query (or subquery) within the outer query’s WHERE clause. It executes the inner query the same way as if it were a stand-alone query. It searches the MANUFACTURERS table for any row where the manufacturer name equals the character string Global Comm Corp and then pulls the MANUNUM values for this row. SQL then substitutes the derived MANUNUM value of 210 from the inner query inside the predicate of the main query (outer query). As a result of this substitution, the SQL query looks identical to the query mentioned previously.
SQL Code
PROC SQL; SELECT invnum, INVOICE.manunum, custnum, invqty, invprice, prodnum FROM INVOICE WHERE manunum = (SELECT manunum 1 FROM MANUFACTURERS WHERE manuname = 'Global Comm Corp'), WHERE INVOICE.manunum=MANUFACTURERS.manunum; QUIT;
Result of Inner Query
PROC SQL;
SELECT *
FROM INVOICE
WHERE manunum = 210; 2
QUIT;
| PROC SQL evaluates the inner query within the outer query’s WHERE clause to search for the manufacturer number for manufacturer Global Comm Corp. |
| The resulting query after substituting the derived manufacturer number value from the inner query evaluates to a single value and is then executed as the main (outer) query. |
Results
The SAS System
Invoice
2 Quantity
Invoice
Invoice Manufacturer Customer - Units
Total Product
Number Number Number Sold
Price Number
__________________________________________________
_______________
1003 210 101 7
$245.00 2101 |
Let’s look at another subquery. Suppose you want to retrieve the invoice from the INVOICE table for the manufacturer that manufactures the Dream Machine workstation. The following subquery (inner query) extracts the product number (PRODNUM) associated with the Dream Machine and passes the single value to the outer query for processing.
SQL Code
PROC SQL; SELECT invnum, manunum, custnum, invqty, invprice, INVOICE.prodnum FROM INVOICE (SELECT prodnum 1 FROM PRODUCTS WHERE prodname LIKE 'Dream%') WHERE INVOICE.prodnum=PRODUCTS.prodnum; QUIT;
Result of Inner Query
PROC SQL;
SELECT *
FROM INVOICE
WHERE prodnum = 1110; 2
QUIT; |
| PROC SQL evaluates the inner query within the outer query’s WHERE clause to search for the product number for the product Dream Machine. |
| The resulting inner query after substituting the derived product number value evaluates to a single value and is then executed as the main (outer) query. |
The SAS System
Invoice
Quantity
2
Invoice Manufacturer Customer - Units
Invoice Unit Product
Number Number Number Sold
Price Number
________________________________________________
________________
1004 111 501 3 $9
,600.00 1110 |
It is fortunate that our subquery in the previous example passed only one row or value to the main (outer) query. Had it returned more than one value from the PRODUCTS table, it would have made it impossible for the SQL to evaluate the condition as true or false and would have produced an error in the outer query. Let’s look at another example where more than one value is returned by the subquery.
In the next example, more than one row is returned by the inner query making it impossible for the main query to evaluate as true or false. As a result, an error is produced and the subquery does not execute. In general, it is best to avoid using the = (equal sign) and other comparison operators (<, >, <=, >=, and <>) in a subquery expression, unless you know in advance that the result of the subquery is a table with a single row of data (although it may not always be possible to know this beforehand). In a later section (“Passing More Than One Row with a Subquery”), you will see this problem alleviated by using the IN predicate.
SQL Code
PROC SQL; SELECT * FROM INVOICE WHERE manunum = (SELECT manunum FROM MANUFACTURERS WHERE UPCASE(manucity) LIKE 'SAN DIEGO%'), QUIT;
PROC SQL; SELECT * FROM INVOICE WHERE manunum = (SELECT manunum FROM MANUFACTURERS WHERE UPCASE(manucity) LIKE 'SAN DIEGO%'), ERROR: Subquery evaluated to more than one row. QUIT; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE SQL used: real time 0.00 seconds |
Let’s look at another subquery example that uses the comparison operator < (less than). A summary function specified in an inner query forces a single row to result. In the next example, the subquery uses the AVG summary (aggregate) function to determine which products (based on their invoice quantities) were purchased in lower quantities than the average product purchase.
SQL Code
PROC SQL; SELECT prodnum, invnum, invqty, invprice FROM INVOICE WHERE invqty < (SELECT AVG(invqty) 1 FROM INVOICE); QUIT;
Result of Inner Query
PROC SQL;
SELECT prodnum, invnum, invqty, invprice
FROM INVOICE
WHERE invqty < 4.285714; 2
QUIT; |
Results
The SAS System
2
Invoice
Quantity
Product Invoice - Units Invoice Unit
Number Number Sold Price
____________________________________________
6001 1002 2 $1,598.00
1110 1004 3 $9,600.00
5002 1005 2 $798.00
6000 1006 4 $396.00 |
To prevent the problem associated with passing more than one value to the main (outer) query, you can specify the IN predicate in a subquery. Similar to the IN operator in the DATA step, the IN predicate permits the SQL procedure to pass multiple row values from the (inner) subquery to the main (outer) query without producing an error.
Note: | PROC SQL does not permit a subquery to select more than one column. The next example shows how multiple row values are passed from the subquery to the main (outer) query using the IN predicate for San Diego manufacturers. |
SQL Code
PROC SQL; SELECT * FROM INVOICE WHERE manunum IN 1 (SELECT manunum FROM MANUFACTURERS WHERE UPCASE(manucity) LIKE 'SAN DIEGO%'), 2 QUIT;
Result of Inner Query
PROC SQL;
SELECT prodnum, invnum, invqty, invprice
FROM INVOICE
WHERE manunum IN (210, 500, 700); 3
QUIT; |
| PROC SQL’s IN predicate is specified in the outer query to process a list of values that are passed from the inner query. |
| PROC SQL evaluates the inner query within the outer query’s WHERE clause to produce a list of manufacturer numbers for San Diego manufacturers. |
| The resulting inner query passes multiple row values to the main (outer) query for execution. |
The SAS System
Invoice
3 Quantity
Invoice
Invoice Manufacturer Customer - Units
Total Product
Number Number Number Sold
Price Number
________________________________________________
_________________
1001 500 201 5 $1
,495.00 5001
1003 210 101 7
$245.00 2101
1005 500 801 2
$798.00 5002
1006 500 901 4
$396.00 6000
1007 500 401 7 $23
,100.00 1200 |
A subquery can have multiple values returned for a single column to the outer query. But there are special keywords that permit comparison operators to be used in subqueries to process multiple values. The special keywords ANY and ALL can be used to compare a set of values returned by a subquery. Let’s see how these keywords work.
Suppose you want to view the products whose inventory quantity is greater than or equal to the lowest average inventory quantity. The following example illustrates a subquery with the ANY keyword specified in the WHERE clause of the main query expression. When ANY is specified, the entire WHERE clause is true if the subquery returns at least one value.
SQL Code
PROC SQL; SELECT manunum, prodnum, invqty, invprice FROM INVOICE WHERE invprice GE ANY 1 (SELECT invprice FROM INVOICE WHERE prodnum IN (5001,5002)); 2 QUIT;
Result of Inner Query
PROC SQL; SELECT manunum, prodnum, invqty, invprice FROM INVOICE WHERE invprice > ANY ($1,495.,$798.); 3 QUIT; |
| PROC SQL retrieves any invoices from the outer query where the invoice price is greater than or equal to the row values passed from the inner query. |
| The WHERE clause of the inner query retrieves any invoice prices for product numbers 5001 and 5002 and passes them to the outer query. |
| The resulting inner query passes multiple row values to the main (outer) query for execution. |
Results
The SAS System Invoice Quantity Manufacturer Product - Units Invoice Unit Number Number Sold Price _____________________________________________ _____ 500 5001 5 $1 ,495.00 600 6001 2 $1 ,598.00 111 1110 3 $9 ,600.00 500 5002 2 $798.00 500 1200 7 $23 ,100.00 |
The ALL keyword works very differently from the ANY keyword. When you specify ALL before a subquery expression, the subquery is true only if the comparison is true for values returned by the subquery. For example, suppose you want to view the products whose inventory quantity is less than the average inventory quantity.
SQL Code
PROC SQL; SELECT manunum, prodnum, invqty, invprice FROM INVOICE WHERE invprice < ALL 1 (SELECT invprice FROM INVOICE WHERE prodnum IN (5001,5002)); 2 QUIT;
Result of Inner Query
PROC SQL; SELECT manunum, prodnum, invqty, invprice FROM INVOICE WHERE invprice < ALL ($1,495.,$798.); 3 QUIT; |
| PROC SQL retrieves all invoices from the outer query where the invoice price is less than the row values passed from the inner query. |
| The WHERE clause of the inner query retrieves all invoice prices for product numbers 5001 and 5002 and passes them to the outer query. |
| The resulting inner query passes multiple row values to the main (outer) query for execution. |
Results
The SAS System Invoice Quantity Manufacturer Product - Units Invoice Unit Number Number Sold Price ________________________________________ ________ 210 2101 7 $245.00 500 6000 4 $396.00 |
In the subquery examples shown so far, the subquery (inner query) operates independently from the main (outer) query. Essentially the subquery’s results are evaluated and used as input to the main (outer) query. Although this is a common way subqueries execute, it is not the only way. SQL also permits a subquery to accept one or more values from its outer query. Once the subquery executes, the results are then passed to the outer query. Subqueries of this variety are called correlated subqueries. The ability to construct subqueries in this manner provides a powerful extension to SQL.
The difference between the subqueries discussed earlier and correlated subqueries is in the way the WHERE clause is constructed. Correlated subqueries relate a column in the subquery with a column in the outer query to determine the rows that match or in certain cases don’t match the expression. Suppose, for example, that we want to view products in the PRODUCTS table that do not appear in the INVOICE table. One way to do this is to construct a correlated subquery.
In the next example, the subquery compares the product number column in the PRODUCTS table with the product number column in the INVOICE table. If at least one match is found (the product appears in both the PRODUCTS and INVOICE tables) then the resulting table from the subquery will not be empty, and the NOT EXISTS condition will be false. However, if no matches are found, then the subquery returns an empty table resulting in the NOT EXISTS condition being true, causing the product number, product name, and product type of the current row in the main (outer) query to be selected.
SQL Code
PROC SQL; SELECT prodnum, prodname, prodtype FROM PRODUCTS WHERE NOT EXISTS 1 (SELECT * FROM INVOICE WHERE PRODUCTS.prodnum = INVOICE.prodnum); 2 QUIT;
| The (inner) subquery receives its value(s) from the main (outer) query. With the value(s), the subquery runs and passes the results back to the main query where the WHERE clause and the NOT EXISTS condition are processed. |
| The inner query selects matching product and invoice information and passes it to the outer query. |
The SAS System Product Number Product Name Product Type ________________________________________________ ______ 1700 Travel Laptop Laptop 2102 Digital Cell Phone Phone 2200 Office Phone Phone 5003 Wordprocessor Software Software 5004 Graphics Software Software |
Correlated subqueries are useful for placing restrictions on the results of an entire query with a HAVING clause (or, when combined with a GROUP BY clause, of an entire group). Suppose you want to know which manufacturers have more than one invoiced product.
In the next example, the subquery compares the manufacturer number in the PRODUCTS table with the manufacturer number in the INVOICE table. A HAVING clause and a COUNT function are specified to select all manufacturers with two or more invoices. Because an aggregate (summary) function is used in an optional HAVING clause, a GROUP BY clause is not needed to select the manufacturers with two or more invoices. An EXISTS condition is specified in the outer query’s WHERE clause to capture only those manufacturers matching the subquery.
SQL Code
PROC SQL; SELECT prodnum, prodname, prodtype FROM PRODUCTS WHERE EXISTS 1 (SELECT * FROM INVOICE WHERE PRODUCTS.manunum = INVOICE.manunum HAVING COUNT(*) > 1); 2 QUIT;
Results
The SAS System Product Number Product Name Product Type _________________________________________ ___________ 5001 Spreadsheet Software Software 5002 Database Software Software 5003 Wordprocessor Software Software 5004 Graphics Software Software |