So far, you've learned how to use aggregates to get overall totals (and averages, and so on) for a set of rows. If you wanted to know the total dollar sales for product 14, you could execute the query we wrote earlier:
select sum(Quantity * UnitPrice * (1.0 – Discount)) as 'Total Dollar Sales' from [Order Details] where ProductID = 14
What if you were also interested in products 11, 12, and 13? You could reexecute the query several times, each time with a different product ID.
It would be better to write a single query and get a result set that looked like this:
ProductID Total Dollar Sales ----------- --------------------------------------- 11 12901.770042419434 12 12257.660041809082 13 4960.4400224685669 14 7991.4900035858154
In the rest of today's session, you will learn how to use GROUP BY clauses to define groupings of aggregates for subtotalling.
If you look closely at the last result set, you will notice that it combines both aggregate and nonaggregate data. The first column, ProductID, is a simple column value. The second column, Total Dollar Sales, is the result of a SUM() function.
If you execute the following query in which you combine aggregate and nonaggregate columns in the select list, you get the following error:
select ProductID, sum(Quantity * UnitPrice * (1.0 - Discount)) as 'Total Dollar Sales' from [Order Details] where ProductID between 11 and 14
Results:
Server: Msg 8118, Level 16, State 1, Line 1 Column 'Order Details.ProductID' is invalid in the select list because it is not contained in an aggregate function and there is no GROUP BY clause.
You need to instruct the server to group (aggregate) the data by the product ID:
select ProductID, sum(Quantity * UnitPrice * (1.0 - Discount)) as 'Total Dollar Sales' from [Order Details] where ProductID between 11 and 14 group by ProductID
The GROUP BY clause instructs the server to build a worktable whose key is ProductID. The other column in the worktable is the aggregate. Each time SQL Server retrieves a row, it looks up the ProductID in the worktable. If the value for ProductID already exists, the server adds the Quantity * UnitPrice product to the running total. If the ProductID is not found in the worktable, the server adds a row with the new value, setting the running total equal to the Dollar Sales for the current row.
When SQL Server exhausts the rows in the source table, it presents the contents of the worktable to the user as the result set.
Grouping works with all the aggregate functions. Here is an example where we find the most recent order date for each customer:
select CustomerID, max(OrderDate) from Orders group by CustomerID
Results:
CustomerID ---------- --------------------------- TOMSP 1998-03-23 00:00:00.000 LILAS 1998-05-05 00:00:00.000 GREAL 1998-04-30 00:00:00.000 HUNGO 1998-04-30 00:00:00.000 … COMMI 1998-04-22 00:00:00.000 BONAP 1998-05-06 00:00:00.000
Each customer is listed along with the latest order date for that customer. Notice that the CustomerID values are not sorted.
Caution
One significant change to SQL Server 2000 occurred with the introduction of new methods of grouping data. In previous versions, grouped output would be sorted by the grouping column. SQL Server 2000 does not guarantee any specific order in grouped data.
Users should include an ORDER BY clause in every GROUP BY query in which the order of the result set matters. Existing systems that rely on the automatic sorting of grouped output must be fixed as part of the upgrade to SQL Server 2000 so that they do not rely on that sorting:
select CustomerID, max(OrderDate) from Orders group by CustomerID order by CustomerID
Transact-SQL allows the keyword all to force the result set to report rows for all possible grouping values. It makes sense to use all when the query includes a WHERE clause.
This query reports the number of employees hired since 1994. The report is organized by title.
select Title, count(*) 'Count' from Employees where HireDate >= '1/1/1994' group by Title
Results:
Title Count ------------------------------ ----------- Inside Sales Coordinator 1 Sales Representative 2
The report includes a row for a title only where an employee matches the search criterion. To get a complete report that includes a row for every job title, add the all keyword to the GROUP BY expression:
select Title, count(*) 'Count' from Employees where HireDate >= '1/1/1994' group by all Title
Results:
Title Count ------------------------------ ----------- Inside Sales Coordinator 1 Sales Manager 0 Sales Representative 2 Vice President, Sales 0
This is a complete list of job titles. Two rows show that no employees having those titles matched the search condition.
You can group on multiple columns or expressions as well. In this example, we see the number of products for each supplier and category:
select SupplierID, CategoryID, count(*) as 'Num of Products' from Products where SupplierID in (1, 2, 4) group by SupplierID, CategoryID
Results:
SupplierID CategoryID Num of Products ----------- ----------- --------------- 1 1 2 1 2 1 2 2 4 4 6 1 4 7 1 4 8 1
With two levels of grouping, the server reports a count for every unique combination of supplier and category.
So far, all the examples have used GROUP BY with a single column. You can also use a calculation or function for grouping. This query uses the DATEPART() function to extract the year from the date, and then reports the number of orders for each year:
select datepart(yy, OrderDate) as 'Year', count(*) as 'Orders' from orders group by datepart(yy, OrderDate) order by datepart(yy, OrderDate)
Results:
Year Orders ----------- ----------- 1996 152 1997 408 1998 270
The key point in this example is that the expressions in the grouping clause and in the select list match exactly. If there were a mismatch between these expressions, the results would be skewed and difficult to understand.
Let's extend the year-by-year order breakdown we just worked on. In addition to grouping on the year, let's add the month. To do this, we'll use two levels of grouping: We'll group first on year, and then on month.
It's a simple extension of what we've done so far, add another expression for month to the select list along with the same expression in the GROUP BY clause. (Use the DATENAME() function with the month so that you can get the name of the date instead of its ordinal number.) Sort the data by year, and then by month, as shown in Listing 4.3.
When you run the query, you should get this result set:
Year Month Orders ----------- ------------------------------ ----------- 1996 August 25 1996 December 31 1996 July 22 1996 November 25 1996 October 26 1996 September 23 1997 April 31 1997 August 33 1997 December 48 1997 February 29 1997 January 33 1997 July 33 1997 June 30 1997 March 30 1997 May 32 1997 November 34 1997 October 38 1997 September 37 1998 April 74 1998 February 54 1998 January 55 1998 March 73 1998 May 14
You can see that the results are broken out by year, and then by month. However, there is a problem: The months aren't sorted chronologically, they are sorted alphabetically by name.
Why didn't SQL Server sort the months properly? From the server's standpoint, the month names are just strings that must be put in order. You sort strings by using an alphabet, not a calendar.
To correct this output, you must display the name of the month, but sort by the number of the month. Listing 4.4 shows the corrected query.
The query fails with this error message:
Server: Msg 8127, Level 16, State 1, Line 2 Column name 'Orders.OrderDate'is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause.
The message warns that the OrderDate column is referenced in the ORDER BY clause but has not been properly referenced in the GROUP BY clause. Of course, the OrderDate column is actually listed twice in the GROUP BY clause: once for the year and once for the date. But the new sorting expression based on the ordinal date number, DATEPART(mm, OrderDate), contains a reference to OrderDate. OrderDate is not grouped.
Why must the server group by a value in order to sort on it? The server performs grouping activities first, and then sorts the grouped output. If DATEPART(mm, OrderDate) is not included in the temporary grouped output, SQL Server cannot sort on it.
Finally, to make this work, we must force the month DATEPART() expression into the grouped output. We can do that without displaying it in the ultimate result set, as shown in Listing 4.5.
Here is the final output:
Year Month Orders ----------- ------------------------------ ----------- 1996 July 22 1996 August 25 1996 September 23 … 1998 February 54 1998 March 73 1998 April 74 1998 May 14
By adding line 10 to the GROUP BY clause, the query worked. The data is sorted properly by year, and then sorted chronologically by month. The results are properly grouped. A little later today, you will learn how to turn this into a pivot table with months on the left and years along the top.
On Day 2, "Filtering and Sorting Data," we looked at the possibility of sorting on a column that is not included in the results. This is one time when it's appropriate not to display the ordering column (the month number). The name of the month enables a user to understand the sorting of the data without displaying the month number.
In the first half of this lesson, you have seen how a WHERE clause removes rows from a result set, but you could not use aggregates in WHERE clauses. A HAVING clause enables you to filter out rows after grouping has occurred, and you can refer to the results of aggregates in a HAVING clause.
Let's enhance what we did in Listing 4.4 by finding only months where there were 70 orders or more. To do that, add a HAVING clause that includes the condition
having count(*) >= 70
Notice that the HAVING clause appears between the GROUP BY and the ORDER BY clauses. To see this, check out line 11 in Listing 4.6.
The results include only two months.
Year Month Orders ----------- ------------------------------ ----------- 1998 March 73 1998 April 74
You often need to display both subtotals and totals in the same result set. Transact-SQL provides the ROLLUP and CUBE operators to provide this extension to the grouping options. Let's examine a result set to understand the effect of these operators.
Here's a result set from an earlier example. It includes an additional row now whose ProductID is null. That row contains the sum of all the TotalDollarSales for all ProductIDs.
ProductID Total Dollar Sales ----------- --------------------------------------- 11 12901.770042419434 12 12257.660041809082 13 4960.4400224685669 14 7991.4900035858154 NULL 38111.360110282898
To get that additional row, add the ROLLUP operator to the GROUP BY clause:
select ProductID, sum(Quantity * UnitPrice) as 'Total Dollar Sales' from [Order Details] where ProductID between 11 and 14 group by ProductID with rollup
Here is another example that uses the MAX() function. The last entry with a null customer ID reports the latest date reported for a customer.
select CustomerID, max(OrderDate) from Orders group by CustomerID with rollup
Results:
CustomerID ---------- --------------------------- ALFKI 1998-04-09 00:00:00.000 ANATR 1998-03-04 00:00:00.000 ANTON 1998-01-28 00:00:00.000 AROUT 1998-04-10 00:00:00.000 BERGS 1998-03-04 00:00:00.000 … WOLZA 1998-04-23 00:00:00.000 NULL 1998-05-06 00:00:00.000
Notice that the data is sorted by CustomerID even though we did not include an ORDER BY expression. When you use ROLLUP, it's important not to include ordering. Otherwise, the ROLLUP lines will be sorted to the top of the list.
So far, we've limited the results to queries with a single grouping expression. When there are multiple grouping expressions, the ROLLUP operator provides subtotals at each grouping level and an overall total at the highest level.
select SupplierID, CategoryID, count(*) as 'Num of Products' from Products where SupplierID in (1, 2, 4) group by SupplierID, CategoryID with rollup
Results:
SupplierID CategoryID Num of Products ----------- ----------- --------------- 1 1 2 1 2 1 1 NULL 3 2 2 4 2 NULL 4 4 6 1 4 7 1 4 8 1 4 NULL 3 NULL NULL 10
The ROLLUP operator has added a number of rows to this result set. The third row (SupplierID 1 and CategoryID null) provides an overall count of products for supplier 1 (all categories). There are similar rows for suppliers 2 and 4. The last row is an overall total for all suppliers and all categories.
Whereas the ROLLUP operator provides subtotals and totals that follow the hierarchy of your grouping expressions, the CUBE operator provides subtotals and totals for all possible combinations of your grouping expressions.
With a single grouping expression, CUBE and ROLLUP provide identical results. Here is the first ROLLUP example, presented this time with CUBE:
select ProductID, sum(Quantity * UnitPrice * (1 - Discount)) as 'Total Dollar Sales' from [Order Details] where ProductID between 11 and 14 group by ProductID with cube
Results:
ProductID Total Dollar Sales ----------- --------------------------------------- 11 12901.770042419434 12 12257.660041809082 13 4960.4400224685669 14 7991.4900035858154 NULL 38111.360110282898
The CUBE result set differs from ROLLUP when there are multiple grouping expressions. Here is the multi-level ROLLUP we looked at a moment ago, now implemented with CUBE:
select SupplierID, CategoryID, count(*) as 'Num of Products' from Products where SupplierID in (1, 2, 4) group by SupplierID, CategoryID with cube
Results:
SupplierID CategoryID Num of Products ----------- ----------- --------------- 1 1 2 1 2 1 1 NULL 3 2 2 4 2 NULL 4 4 6 1 4 7 1 4 8 1 4 NULL 3 NULL NULL 10 NULL 1 2 NULL 2 5 NULL 6 1 NULL 7 1 NULL 8 1
The first 10 rows of results match exactly the results from the ROLLUP query. They include the summaries for suppliers 1, 2, and 4, as well as an overall summary for all suppliers and categories. The last five rows provide category summaries across all suppliers. Grouped output includes all possible combinations of supplier and category, as well as totals across each supplier and each category.
There might be instances when you need to know whether a row was created as part of a ROLLUP or CUBE operation. Use the grouping function to display a 1 or 0 in the result set.
select SupplierID, CategoryID, count(*) as 'NumProducts', grouping(CategoryID) 'CategoryGrp', grouping(SupplierID) 'SupplierGrp' from Products where SupplierID in (1, 2, 4) group by SupplierID, CategoryID with cube
Results:
SupplierID CategoryID NumProducts CategoryGrp SupplierGrp ----------- ----------- ----------- ----------- ----------- 1 1 2 0 0 1 2 1 0 0 1 NULL 3 1 0 2 2 4 0 0 2 NULL 4 1 0 4 6 1 0 0 4 7 1 0 0 4 8 1 0 0 4 NULL 3 1 0 NULL NULL 10 1 1 NULL 1 2 0 1 NULL 2 5 0 1 NULL 6 1 0 1 NULL 7 1 0 1 NULL 8 1 0 1
Each row is tagged to indicate whether there was a rollup on that row for the category or supplier column. If an actual null SupplierID or CategoryID were included in the result set, this would be the only way to distinguish between CUBE or ROLLUP rows and actual data. This enables you to identify the column on which the data is actually being aggregated.
Adding the ROLLUP operator to the year and month orders report will expose some real-life application issues. Listing 4.7 lists the new query, now including ROLLUP.
Notice that the ORDER BY clause has been removed from the query now that we've added ROLLUP in line 10. ROLLUP does its own sorting. The grouping order matters now, so make sure that the grouping is first by year, and then by numeric month (datepart), and last by alphabetic month (datename). Otherwise, ROLLUP will sort the data incorrectly.
Here are the results. ROLLUP added summary rows for each year as well as a grand total at the end. Unfortunately, it also appears to have added a duplicate row for each month with a month name of NULL.
Year Month Orders ----------- ------------------------------ ----------- 1996 July 22 1996 NULL 22 1996 August 25 1996 NULL 25 … 1996 December 31 1996 NULL 31 1996 NULL 152 1997 January 33 1997 NULL 33 … 1998 May 14 1998 NULL 14 1998 NULL 270 NULL NULL 830
We need to find a way to remove those extra rows. Let's display the DATEPART expression that finds the numeric month in the result set. Listing 4.8 provides the revised query with the additional expression included in the select list.
Results:
Year Month Month Number Orders ----------- ------------- ------------ ----------- 1996 July 7 22 1996 NULL 7 22 1996 August 8 25 1996 NULL 8 25 … 1996 December 12 31 1996 NULL 12 31 1996 NULL NULL 152 1997 January 1 33 1997 NULL 1 33 … 1998 May 5 14 1998 NULL 5 14 1998 NULL NULL 270 NULL NULL NULL 830
The query provides three levels of grouping, so ROLLUP provides three levels of output. The second and third levels of grouping (month name and month number) represent the same data. We need to suppress the extra grouped rows in the output.
To filter grouped data, we'll add a HAVING clause. (Rows created by ROLLUP aren't available at the time the WHERE clause is evaluated.) The extra rows have a non-null value for the month number and a null value for the month name. Listing 4.9 includes the HAVING clause to suppress the extra rows.
The HAVING clause eliminates rows where the numeric datepart is not null and the alphabetic datename is null. Those are the rows to be eliminated. The result set is now correct and complete:
Year Month Orders ----------- ------------------------------ ----------- 1996 July 22 1996 August 25 … 1996 December 31 1996 NULL 152 1997 January 33 … 1998 May 14 1998 NULL 270 NULL NULL 830