Grouping Data

So far, you've learned how to use aggregates to get overall totals (and averages, and so on) for a set of rows. If you wanted to know the total dollar sales for product 14, you could execute the query we wrote earlier:

select sum(Quantity * UnitPrice * (1.0 – Discount))
                     as 'Total Dollar Sales'
  from [Order Details]
 where ProductID = 14

What if you were also interested in products 11, 12, and 13? You could reexecute the query several times, each time with a different product ID.

It would be better to write a single query and get a result set that looked like this:

ProductID   Total Dollar Sales
----------- ---------------------------------------
         11                      12901.770042419434
         12                      12257.660041809082
         13                      4960.4400224685669
         14                      7991.4900035858154

In the rest of today's session, you will learn how to use GROUP BY clauses to define groupings of aggregates for subtotalling.

GROUP BY Clauses

If you look closely at the last result set, you will notice that it combines both aggregate and nonaggregate data. The first column, ProductID, is a simple column value. The second column, Total Dollar Sales, is the result of a SUM() function.

If you execute the following query in which you combine aggregate and nonaggregate columns in the select list, you get the following error:

select ProductID,
       sum(Quantity * UnitPrice * (1.0 - Discount))
                     as 'Total Dollar Sales'
  from [Order Details]
 where ProductID between 11 and 14

Results:

Server: Msg 8118, Level 16, State 1, Line 1
Column 'Order Details.ProductID' is invalid in the select
list because it is not contained in an aggregate function
and there is no GROUP BY clause.

You need to instruct the server to group (aggregate) the data by the product ID:

select ProductID,
       sum(Quantity * UnitPrice * (1.0 - Discount))
                     as 'Total Dollar Sales'
  from [Order Details]
 where ProductID between 11 and 14
 group by ProductID

The GROUP BY clause instructs the server to build a worktable whose key is ProductID. The other column in the worktable is the aggregate. Each time SQL Server retrieves a row, it looks up the ProductID in the worktable. If the value for ProductID already exists, the server adds the Quantity * UnitPrice product to the running total. If the ProductID is not found in the worktable, the server adds a row with the new value, setting the running total equal to the Dollar Sales for the current row.

When SQL Server exhausts the rows in the source table, it presents the contents of the worktable to the user as the result set.

Grouping works with all the aggregate functions. Here is an example where we find the most recent order date for each customer:

select CustomerID, max(OrderDate)
from Orders
group by CustomerID

Results:

CustomerID
---------- ---------------------------
TOMSP      1998-03-23 00:00:00.000
LILAS      1998-05-05 00:00:00.000
GREAL      1998-04-30 00:00:00.000
HUNGO      1998-04-30 00:00:00.000
…
COMMI      1998-04-22 00:00:00.000
BONAP      1998-05-06 00:00:00.000

Each customer is listed along with the latest order date for that customer. Notice that the CustomerID values are not sorted.

Caution

One significant change to SQL Server 2000 occurred with the introduction of new methods of grouping data. In previous versions, grouped output would be sorted by the grouping column. SQL Server 2000 does not guarantee any specific order in grouped data.

Users should include an ORDER BY clause in every GROUP BY query in which the order of the result set matters. Existing systems that rely on the automatic sorting of grouped output must be fixed as part of the upgrade to SQL Server 2000 so that they do not rely on that sorting:

select CustomerID, max(OrderDate)
from Orders
group by CustomerID
order by CustomerID


Grouping with all

Transact-SQL allows the keyword all to force the result set to report rows for all possible grouping values. It makes sense to use all when the query includes a WHERE clause.

This query reports the number of employees hired since 1994. The report is organized by title.

select Title, count(*) 'Count'
  from Employees
 where HireDate >= '1/1/1994'
 group by Title

Results:

Title                          Count
------------------------------ -----------
Inside Sales Coordinator                 1
Sales Representative                     2

The report includes a row for a title only where an employee matches the search criterion. To get a complete report that includes a row for every job title, add the all keyword to the GROUP BY expression:

select Title, count(*) 'Count'
  from Employees
 where HireDate >= '1/1/1994'
 group by all Title

Results:

Title                          Count
------------------------------ -----------
Inside Sales Coordinator                 1
Sales Manager                            0
Sales Representative                     2
Vice President, Sales                    0

This is a complete list of job titles. Two rows show that no employees having those titles matched the search condition.

Using Multi-Level Grouping

You can group on multiple columns or expressions as well. In this example, we see the number of products for each supplier and category:

select SupplierID,
       CategoryID,
       count(*) as 'Num of Products'
  from Products
 where SupplierID in (1, 2, 4)
 group by
       SupplierID,
       CategoryID

Results:

SupplierID  CategoryID  Num of Products
----------- ----------- ---------------
          1           1               2
          1           2               1
          2           2               4
          4           6               1
          4           7               1
          4           8               1

With two levels of grouping, the server reports a count for every unique combination of supplier and category.

Grouping by Complex Expressions

So far, all the examples have used GROUP BY with a single column. You can also use a calculation or function for grouping. This query uses the DATEPART() function to extract the year from the date, and then reports the number of orders for each year:

select
       datepart(yy, OrderDate) as 'Year',
       count(*) as 'Orders'
  from orders
 group by
       datepart(yy, OrderDate)
 order by
       datepart(yy, OrderDate)

Results:

Year        Orders
----------- -----------
       1996         152
       1997         408
       1998         270

The key point in this example is that the expressions in the grouping clause and in the select list match exactly. If there were a mismatch between these expressions, the results would be skewed and difficult to understand.

Let's extend the year-by-year order breakdown we just worked on. In addition to grouping on the year, let's add the month. To do this, we'll use two levels of grouping: We'll group first on year, and then on month.

It's a simple extension of what we've done so far, add another expression for month to the select list along with the same expression in the GROUP BY clause. (Use the DATENAME() function with the month so that you can get the name of the date instead of its ordinal number.) Sort the data by year, and then by month, as shown in Listing 4.3.

Code Listing 4.3. A Select Statement with Multi-Level Grouping
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from orders
 6:  group by
 7:        datepart(yy, OrderDate),
 8:        datename(mm, OrderDate)
 9:  order by
10:        datepart(yy, OrderDate),
11:        datename(mm, OrderDate)

When you run the query, you should get this result set:

Year        Month                          Orders
----------- ------------------------------ -----------
       1996 August                                  25
       1996 December                                31
       1996 July                                    22
       1996 November                                25
       1996 October                                 26
       1996 September                               23
       1997 April                                   31
       1997 August                                  33
       1997 December                                48
       1997 February                                29
       1997 January                                 33
       1997 July                                    33
       1997 June                                    30
       1997 March                                   30
       1997 May                                     32
       1997 November                                34
       1997 October                                 38
       1997 September                               37
       1998 April                                   74
       1998 February                                54
       1998 January                                 55
       1998 March                                   73
       1998 May                                     14

You can see that the results are broken out by year, and then by month. However, there is a problem: The months aren't sorted chronologically, they are sorted alphabetically by name.

Why didn't SQL Server sort the months properly? From the server's standpoint, the month names are just strings that must be put in order. You sort strings by using an alphabet, not a calendar.

To correct this output, you must display the name of the month, but sort by the number of the month. Listing 4.4 shows the corrected query.

Code Listing 4.4. This Query Attempts to Sort on the Month Number but Displays the Month Name
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from Orders
 6:  where OrderDate < '1/1/2000'
 7:  group by
 8:        datepart(yy, OrderDate),
 9:        datename(mm, OrderDate)
10: order by
11:        datepart(yy, OrderDate),
12:        datepart(mm, OrderDate)

The query fails with this error message:

Server: Msg 8127, Level 16, State 1, Line 2
Column name 'Orders.OrderDate'is invalid in the ORDER BY
clause because it is not contained in either an aggregate
function or the GROUP BY clause.

The message warns that the OrderDate column is referenced in the ORDER BY clause but has not been properly referenced in the GROUP BY clause. Of course, the OrderDate column is actually listed twice in the GROUP BY clause: once for the year and once for the date. But the new sorting expression based on the ordinal date number, DATEPART(mm, OrderDate), contains a reference to OrderDate. OrderDate is not grouped.

Why must the server group by a value in order to sort on it? The server performs grouping activities first, and then sorts the grouped output. If DATEPART(mm, OrderDate) is not included in the temporary grouped output, SQL Server cannot sort on it.

Finally, to make this work, we must force the month DATEPART() expression into the grouped output. We can do that without displaying it in the ultimate result set, as shown in Listing 4.5.

Code Listing 4.5. This Query Sorts on the Month Number but Displays the Month Name
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from Orders
 6:  where OrderDate < '1/1/2000'
 7:  group by
 8:        datepart(yy, OrderDate),
 9:        datename(mm, OrderDate),
10:        datepart(mm, OrderDate)
11:  order by
12:        datepart(yy, OrderDate),
13:        datepart(mm, OrderDate)

Here is the final output:

Year        Month                          Orders
----------- ------------------------------ -----------
       1996 July                                    22
       1996 August                                  25
       1996 September                               23
…
       1998 February                                54
       1998 March                                   73
       1998 April                                   74
       1998 May                                     14

By adding line 10 to the GROUP BY clause, the query worked. The data is sorted properly by year, and then sorted chronologically by month. The results are properly grouped. A little later today, you will learn how to turn this into a pivot table with months on the left and years along the top.

On Day 2, "Filtering and Sorting Data," we looked at the possibility of sorting on a column that is not included in the results. This is one time when it's appropriate not to display the ordering column (the month number). The name of the month enables a user to understand the sorting of the data without displaying the month number.

Using HAVING to Remove Grouped Rows

In the first half of this lesson, you have seen how a WHERE clause removes rows from a result set, but you could not use aggregates in WHERE clauses. A HAVING clause enables you to filter out rows after grouping has occurred, and you can refer to the results of aggregates in a HAVING clause.

Let's enhance what we did in Listing 4.4 by finding only months where there were 70 orders or more. To do that, add a HAVING clause that includes the condition

having count(*) >= 70

Notice that the HAVING clause appears between the GROUP BY and the ORDER BY clauses. To see this, check out line 11 in Listing 4.6.

Code Listing 4.6. The HAVING Clause Removes Months with Fewer than 70 Orders
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from Orders
 6:  where OrderDate < '1/1/2000'
 7:  group by
 8:        datepart(yy, OrderDate),
 9:        datename(mm, OrderDate),
10:        datepart(mm, OrderDate)
11: having count(*) >= 70
12:  order by
13:        datepart(yy, OrderDate),
14:        datepart(mm, OrderDate)

The results include only two months.

Year        Month                          Orders
----------- ------------------------------ -----------
       1998 March                                   73
       1998 April                                   74
						

Combining Subtotals and Totals with ROLLUP and CUBE

You often need to display both subtotals and totals in the same result set. Transact-SQL provides the ROLLUP and CUBE operators to provide this extension to the grouping options. Let's examine a result set to understand the effect of these operators.

Grouping with ROLLUP

Here's a result set from an earlier example. It includes an additional row now whose ProductID is null. That row contains the sum of all the TotalDollarSales for all ProductIDs.

ProductID   Total Dollar Sales
----------- ---------------------------------------
         11                      12901.770042419434
         12                      12257.660041809082
         13                      4960.4400224685669
         14                      7991.4900035858154
NULL                             38111.360110282898

To get that additional row, add the ROLLUP operator to the GROUP BY clause:

select ProductID,
       sum(Quantity * UnitPrice) as 'Total Dollar Sales'
  from [Order Details]
 where ProductID between 11 and 14
 group by ProductID with rollup

Here is another example that uses the MAX() function. The last entry with a null customer ID reports the latest date reported for a customer.

select CustomerID, max(OrderDate)
from Orders
group by CustomerID with rollup

Results:

CustomerID
---------- ---------------------------
ALFKI      1998-04-09 00:00:00.000
ANATR      1998-03-04 00:00:00.000
ANTON      1998-01-28 00:00:00.000
AROUT      1998-04-10 00:00:00.000
BERGS      1998-03-04 00:00:00.000
…
WOLZA      1998-04-23 00:00:00.000
NULL       1998-05-06 00:00:00.000

Notice that the data is sorted by CustomerID even though we did not include an ORDER BY expression. When you use ROLLUP, it's important not to include ordering. Otherwise, the ROLLUP lines will be sorted to the top of the list.

So far, we've limited the results to queries with a single grouping expression. When there are multiple grouping expressions, the ROLLUP operator provides subtotals at each grouping level and an overall total at the highest level.

select SupplierID,
       CategoryID,
       count(*) as 'Num of Products'
  from Products
 where SupplierID in (1, 2, 4)
 group by
       SupplierID,
       CategoryID with rollup

Results:

SupplierID  CategoryID  Num of Products
----------- ----------- ---------------
          1           1               2
          1           2               1
          1 NULL                      3
          2           2               4
          2 NULL                      4
          4           6               1
          4           7               1
          4           8               1
          4 NULL                      3
NULL        NULL                     10

The ROLLUP operator has added a number of rows to this result set. The third row (SupplierID 1 and CategoryID null) provides an overall count of products for supplier 1 (all categories). There are similar rows for suppliers 2 and 4. The last row is an overall total for all suppliers and all categories.

Grouping with CUBE

Whereas the ROLLUP operator provides subtotals and totals that follow the hierarchy of your grouping expressions, the CUBE operator provides subtotals and totals for all possible combinations of your grouping expressions.

With a single grouping expression, CUBE and ROLLUP provide identical results. Here is the first ROLLUP example, presented this time with CUBE:

select ProductID,
       sum(Quantity * UnitPrice * (1 - Discount))
               as 'Total Dollar Sales'
  from [Order Details]
 where ProductID between 11 and 14
 group by ProductID with cube

Results:

ProductID   Total Dollar Sales
----------- ---------------------------------------
         11                      12901.770042419434
         12                      12257.660041809082
         13                      4960.4400224685669
         14                      7991.4900035858154
NULL                             38111.360110282898

The CUBE result set differs from ROLLUP when there are multiple grouping expressions. Here is the multi-level ROLLUP we looked at a moment ago, now implemented with CUBE:

select SupplierID,
       CategoryID,
       count(*) as 'Num of Products'
  from Products
 where SupplierID in (1, 2, 4)
 group by
       SupplierID,
       CategoryID with cube

Results:

SupplierID  CategoryID  Num of Products
----------- ----------- ---------------
          1           1               2
          1           2               1
          1 NULL                      3
          2           2               4
          2 NULL                      4
          4           6               1
          4           7               1
          4           8               1
          4 NULL                      3




NULL        NULL                     10
NULL                  1               2
NULL                  2               5
NULL                  6               1
NULL                  7               1
NULL                  8               1

The first 10 rows of results match exactly the results from the ROLLUP query. They include the summaries for suppliers 1, 2, and 4, as well as an overall summary for all suppliers and categories. The last five rows provide category summaries across all suppliers. Grouped output includes all possible combinations of supplier and category, as well as totals across each supplier and each category.

Identifying ROLLUP and CUBE Rows with grouping

There might be instances when you need to know whether a row was created as part of a ROLLUP or CUBE operation. Use the grouping function to display a 1 or 0 in the result set.

select SupplierID,
       CategoryID,
       count(*) as 'NumProducts',
       grouping(CategoryID) 'CategoryGrp',
       grouping(SupplierID) 'SupplierGrp'
  from Products
 where SupplierID in (1, 2, 4)
 group by
       SupplierID,
       CategoryID with cube

Results:

SupplierID  CategoryID  NumProducts CategoryGrp SupplierGrp
----------- ----------- ----------- ----------- -----------
          1           1           2           0           0
          1           2           1           0           0
          1 NULL                  3           1           0
          2           2           4           0           0
          2 NULL                  4           1           0
          4           6           1           0           0
          4           7           1           0           0
          4           8           1           0           0
          4 NULL                  3           1           0
NULL        NULL                 10           1           1
NULL                  1           2           0           1
NULL                  2           5           0           1
NULL                  6           1           0           1
NULL                  7           1           0           1
NULL                  8           1           0           1

Each row is tagged to indicate whether there was a rollup on that row for the category or supplier column. If an actual null SupplierID or CategoryID were included in the result set, this would be the only way to distinguish between CUBE or ROLLUP rows and actual data. This enables you to identify the column on which the data is actually being aggregated.

Applying ROLLUP

Adding the ROLLUP operator to the year and month orders report will expose some real-life application issues. Listing 4.7 lists the new query, now including ROLLUP.

Code Listing 4.7. Breakdowns by Year and Month with ROLLUP (First Try)
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from Orders
 6:  where OrderDate < '1/1/2000'
 7:  group by
 8:        datepart(yy, OrderDate),
 9:        datepart(mm, OrderDate),
10:        datename(mm, OrderDate) with rollup
							

Notice that the ORDER BY clause has been removed from the query now that we've added ROLLUP in line 10. ROLLUP does its own sorting. The grouping order matters now, so make sure that the grouping is first by year, and then by numeric month (datepart), and last by alphabetic month (datename). Otherwise, ROLLUP will sort the data incorrectly.

Here are the results. ROLLUP added summary rows for each year as well as a grand total at the end. Unfortunately, it also appears to have added a duplicate row for each month with a month name of NULL.

Year        Month                          Orders
----------- ------------------------------ -----------
       1996 July                                    22
       1996 NULL                                    22
       1996 August                                  25
       1996 NULL                                    25
…
       1996 December                                31
       1996 NULL                                    31
       1996 NULL                                   152
       1997 January                                 33
       1997 NULL                                    33
 …
       1998 May                                     14
       1998 NULL                                    14
       1998 NULL                                   270
NULL        NULL                                   830

We need to find a way to remove those extra rows. Let's display the DATEPART expression that finds the numeric month in the result set. Listing 4.8 provides the revised query with the additional expression included in the select list.

Code Listing 4.8. Breakdowns by Year and Month with ROLLUP (Second Try). Displays the Value of the Month datepart Expression
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        datepart(mm, OrderDate) as 'Month Number',
 5:        count(*) as 'Orders'
 6:   from Orders
 7:  where OrderDate < '1/1/2000'
 8: group by
 9:        datepart(yy, OrderDate),
10:        datepart(mm, OrderDate),
11:        datename(mm, OrderDate) with rollup

Results:

Year        Month         Month Number Orders
----------- ------------- ------------ -----------
       1996 July                     7          22
       1996 NULL                     7          22
       1996 August                   8          25
       1996 NULL                     8          25
…
       1996 December                12          31
       1996 NULL                    12          31
       1996 NULL          NULL                 152
       1997 January                  1          33
       1997 NULL                     1          33
…
       1998 May                      5          14
       1998 NULL                     5          14
       1998 NULL          NULL                 270
NULL        NULL          NULL                 830
							

The query provides three levels of grouping, so ROLLUP provides three levels of output. The second and third levels of grouping (month name and month number) represent the same data. We need to suppress the extra grouped rows in the output.

To filter grouped data, we'll add a HAVING clause. (Rows created by ROLLUP aren't available at the time the WHERE clause is evaluated.) The extra rows have a non-null value for the month number and a null value for the month name. Listing 4.9 includes the HAVING clause to suppress the extra rows.

Code Listing 4.9. Breakdowns by Year and Month with ROLLUP (Final). The HAVING Clause Suppresses Unnecessary ROLLUP Data Generated by the Extra Grouping Expression
 1: select
 2:        datepart(yy, OrderDate) as 'Year',
 3:        datename(mm, OrderDate) as 'Month',
 4:        count(*) as 'Orders'
 5:   from Orders
 6:  where OrderDate < '1/1/2000'
 7: group by
 8:        datepart(yy, OrderDate),
 9:        datepart(mm, OrderDate),
10:        datename(mm, OrderDate) with rollup
11: having
12:        datepart(mm, OrderDate) is null
13:     or datename(mm, OrderDate) is not null
							

The HAVING clause eliminates rows where the numeric datepart is not null and the alphabetic datename is null. Those are the rows to be eliminated. The result set is now correct and complete:

Year        Month                          Orders
----------- ------------------------------ -----------
       1996 July                                    22
       1996 August                                  25
…
       1996 December                                31
       1996 NULL                                   152
       1997 January                                 33
…
       1998 May                                     14
       1998 NULL                                   270
NULL        NULL                                   830
							

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset