Working with Aggregate Functions

All the functions you learned about yesterday are row-based: The function acts only on columns in the current row. Aggregate functions act on a set of rows. What are aggregate functions? You probably know them better as the summary, average, minimum, maximum, and count functions. "To aggregate" means to collect or to bring together. So, aggregate functions gather together data from many rows and report it in a single row (see Figure 4.1).

Figure 4.1. Aggregate functions gather values from many rows in a table into a single row in the result set.


SQL Server supports six aggregate functions:

  • SUM()—Total of all values; works with numeric data types

  • AVG()—Average of all values; works with numeric data types

  • MIN()—Minimum value; works with all data types

  • MAX()—Maximum value; works with all data types

  • COUNT()—Number of values; works with all data types

  • COUNT_BIG()—Number of values; works with all data types; returns a value of type bigint

In addition, SQL Server supports four additional numeric functions for statistical aggregation:

  • STDEV()—Standard deviation for the matching rows

  • STDEVP()—Standard deviation for the entire population

  • VAR()—Variance for the matching rows

  • VARP()—Variance for the entire population

Note

The statistical functions are used the same way you would use any aggregate functions.


We'll go through a number of examples, and then we'll look at how aggregates interact with DISTINCT, null values, and a variety of data types.

Using SUM() and AVG()

The SUM() and AVG() functions work with numeric expressions. As a quick review, here is the list of rows in the Order Details table for the product tofu (ProductID 14).

OrderID     ProductID   UnitPrice             Quantity Discount
----------- ----------- --------------------- -------- ------------------------
10249       14          18.6000               9        0.0
10325       14          18.6000               9        0.0
10333       14          18.6000               10       0.0
10375       14          18.6000               15       0.0
10393       14          18.6000               42       0.25
10409       14          18.6000               12       0.0
10412       14          18.6000               20       0.1
.
.
.
10794       14          23.2500               15       0.2
11076       14          23.2500               20       0.25
11077       14          23.2500               1        2.9999999E-2

(22 row(s) affected)

As you can see, there are many orders for tofu. Now, here's the total quantity of tofu (ProductID 14) sales in the Order Details table:

select sum(Quantity)
  from [Order Details]
 where ProductID = 14

Results:

-----------
        404

For each row found that matches the criteria (ProductID = 14), SQL Server evaluates the expression within the SUM() function (Quantity) and adds that result to the running total. When all the rows have been evaluated, the server reports the value in the running total. In this case, SQL Server totaled 404 tofu sales.

You can have multiple aggregate requests in a single query. Each aggregate operates over the same set of rows. I've also added a column alias for each function:

select sum(Quantity) as 'Qty Sold',
       avg(Quantity) as 'Avg Qty'
  from [Order Details]
 where ProductID = 14

Results:

Qty Sold    Avg Qty
----------- -----------
        404          18

Note

SQL Server returns an integer result for both the sum and average because the original Quantity column is stored as a small integer. You won't lose accuracy with a sum, but if you need the average of an integer data type to provide greater accuracy, you should change the expression to a floating point or numeric data type by using the cast or convert function. (See the section "Using Functions on Column Data" in Day 3, "Working with Columns." ) You can also multiply the Quantity column by 1.0, as in this example:

select sum(Quantity) as 'Qty Sold',
       avg(Quantity * 1.0) as 'Avg Qty'
  from [Order Details]
 where ProductID = 14

Results:

Qty Sold    Avg Qty
----------- ----------------------------------------
         404                                    18.363636


If no rows are found in the result set, aggregates return null. This query looks for sales of product 1400, which doesn't exist:

select sum(Quantity) as 'Qty Sold',
       avg(Quantity) as 'Avg Qty'
  from [Order Details]
 where ProductID = 1400

Results:

Qty Sold    Avg Qty
----------- -----------
NULL        NULL
						

Using Aggregates with Expressions

You can also use complex expressions as arguments in an aggregate. In the next set of examples, we'll compute an extended price for a detail line in Order Details, and then sum those extended prices. To determine the total extended price, you need to multiply the UnitPrice by the Quantity sold, and then apply the discount. This detailed result set shows how ExtendedPrice is derived for each row in the Order Details table (please note that to save space on the page, I removed a number of rows from the result set):

select Quantity,
       UnitPrice,
       Discount,
       Quantity * UnitPrice * (1.0 – Discount)
                    as 'ExtendedPrice'
  from [Order Details]
 where ProductID = 14

Results:

Quantity UnitPrice      Discount       ExtendedPrice
-------- -------------- -------------- -------------
       9        18.6000            0.0     167.39999
       9        18.6000            0.0     167.39999
      10        18.6000            0.0         186.0
      15        18.6000            0.0         279.0
      42        18.6000           0.25     585.90002
…
      20        23.2500           0.25        348.75
       1        23.2500   2.9999999E-2       22.5525

Tip

This result set is a critical step in developing complex queries with aggregates in Transact-SQL. The preceding result set shows the way a computation works. It details all the components of the computation and enables me to examine individual rows. That gives me a chance to make sure that the results are correct. After the data is rolled into an aggregate value, it's much harder to tell whether your results make sense.

Take the time to write queries from the bottom up. Work with small sets of output. Examine a few rows. Ask yourself, "Does this result make sense?" Then you can go ahead and apply aggregates.


In the first several rows, the unit price is $18.60; in the last two, the unit price is $23.25. Discounts vary from no discount to 30% off (the last value). The extended price expression seems to work correctly.

To find out the total dollar value of the sales of this product, I need to sum the ExtendedPrice expression:

select sum(Quantity * UnitPrice * (1.0 - Discount))
                     as 'Total Dollar Sales'
  from [Order Details]
 where ProductID = 14

Results:

Total Dollar Sales
-------------------------------------
                   7991.4900035858154

For each row matching the criteria, the server found the quantity and unit price, multiplied them together, applied the discount, and then added that result to the running total for Total Dollar Sales. After all the rows were processed, SQL Server reported the running total.

Note

Why is the Total Dollars Sales column displayed so strangely? When SQL Server performs a mathematical calculation involving different data types, it follows specific rules on the target data type for the result. Quantity is a smallint data type, UnitPrice is money (four decimal places, range up to 900 trillion), and Discount is a float column (15 significant digits, range up to 10^308 power).

When you multiply an integer value and a money value, you get a money result. When you multiply anything by a float, you get a float. The rules seek to avoid arithmetic overflow (exceeding the maximum size of the value allowed in a data type) by always using the largest type of those involved.


Using Aggregates in Complex Expressions

You can use aggregate functions in expressions after they have been calculated, but those expressions should contain other aggregates and constants only. To determine the average unit price for all sales of product 14, divide the Total Dollar Sales (the aggregate we created in the preceding example) by the Qty Sold:

select sum(Quantity) as 'Qty Sold',
       sum(Quantity * UnitPrice * (1.0 - Discount))
                     as 'Total Dollar Sales',
       sum(Quantity * UnitPrice * (1.0 - Discount)) /
                sum(Quantity) as 'Average Unit Price'
  from [Order Details]
 where ProductID = 14

Results:

Qty Sold    Total Dollar Sales       Average Unit Price
----------- ------------------------ -----------------------
        404       7991.4900035858154      19.780915850459941

First, the server determines the value for Total Dollar Sales and Qty Sold. As a last step before reporting the results, SQL Server performs the division to derive the Average Unit Price.

Using MIN() and MAX()

The functions MIN() and MAX() look for the lowest and highest values for a column or expression. Let's get the lowest and highest sale price for a product:

select min(UnitPrice) as 'Min Price',
       max(UnitPrice) as 'Max Price'
  from [Order Details]
 where ProductID = 14

Results:

Min Price             Max Price
--------------------- ---------------------
              18.6000               23.2500

Unlike SUM() and AVG(), MIN(), and MAX() work with any data type. This query reports the earliest recorded order date:

select min(OrderDate) as 'Earliest Order'
  from Orders

Results:

Earliest Order
---------------------------
1996-07-04 00:00:00.000

To find the earliest order date in February 1997, use this:

select min(OrderDate) as 'Earliest Order'
  from Orders
 where OrderDate >= '2/1/97'
   and OrderDate < '3/1/97'

Results:

Earliest Order
---------------------------
1997-02-03 00:00:00.000

MIN() and MAX() also work with binary data and with strings. This final example of MIN() and MAX() finds the last product name in the price list:

select max(ProductName)
from Products

Results:

----------------------------------------
Zaanse koeken

Using Aggregates in a WHERE Clause

You have just used an aggregate function to find the earliest order date, but after you find that date, you probably want to know more details about what happened on the date. It's tempting to insert the aggregate function into a WHERE clause like this:

/* WARNING BAD EXAMPLE DOESN'T WORK */
/* DANGER DANGER DANGER */
select OrderID, CustomerID, EmployeeID
from Orders
where OrderDate = min(OrderDate)

Go ahead and try it. Doing so will give you the chance to provoke another stern error message from the server:

Server: Msg 147, Level 15, State 1, Line 5
An aggregate may not appear in the WHERE clause unless
it is in a subquery contained in a HAVING clause or a
select list, and the column being aggregated is an
outer reference.

You probably have no idea what the error message is trying to tell you. So, let's simplify the message:

An aggregate may not appear in the WHERE clause unless it is in a subquery.

Note

What's a subquery? It's a query used within another query. Normally the server will get the value of the subquery first, and then use that value in the outer query. Day 6, "Using Subqueries," will go into detail on subqueries. For now, you need to learn just one way to use a subquery.


If you want to use the result of an aggregate function in the WHERE clause, you need to tell the server to retrieve the value of the aggregate first, and then use that value in the search condition in the main query. The subquery is enclosed in parentheses, as shown here:

/* This example works just fine */
select OrderID, CustomerID, EmployeeID
from Orders
where OrderDate =
      (
      select min(OrderDate)
        from Orders
      )

Results:

OrderID     CustomerID EmployeeID
----------- ---------- -----------
      10248 VINET                5

Armed with the earliest date, the server was able to find an order from that date. Notice that the minimum Order Date was never returned as part of the result set. The server uses results from subqueries internally; they are not displayed to the user. If you want to see the data being used, you need to add the Order date to the main SELECT statement as shown:

/* This example works just fine */
select OrderID, OrderDate, CustomerID, EmployeeID
from Orders
where OrderDate =
      (
      select min(OrderDate)
        from Orders
      )

Results:

OrderID     OrderDate   CustomerID EmployeeID
----------- ----------- ---------- -----------
10248     1996-07-04 00:00:00.000   VINET      5

I can also add the February 1997 condition to the subquery statement to find out information about the earliest order in that month.

/* This example works just fine */
select OrderID, CustomerID, EmployeeID
from Orders
where OrderDate =
      (
      select min(OrderDate)
        from Orders
       where OrderDate >= '2/1/97'
         and OrderDate <  '3/1/97'
      )

Results:

OrderID     CustomerID EmployeeID
----------- ---------- -----------
      10433 PRINI                3
      10434 FOLKO                3

The result set includes two rows, not one. Even though the aggregate function returned only a single value for the minimum date in February, there were two rows that matched the condition in the main query. After the subquery has derived the earliest date (February 3, 1997), that value is substituted for the entire subquery and this outer query runs:

select OrderID, CustomerID, EmployeeID
from Orders
where OrderDate = '1997-02-03 00:00:00.000'

Using COUNT()

The COUNT() function enables you to find the number of values or rows that match the search conditions. How many products are on the price list? How many employees are there? How many times was tofu purchased in February 1997? When you are interested in how many times a value appears in a list, you should use the COUNT() aggregate.

How many orders in February 1997?

select count(OrderID) as 'Num of Orders'
  from Orders
 where OrderDate >= '2/1/97'
   and OrderDate < '3/1/97'

Results:

Num of Orders
-------------
           29

How many employees?

select count(EmployeeID)
from Employees

Results:

-----------
          9
						

Using COUNT(*)

You might be wondering how to choose the column for the COUNT() function. The real question you asked in the previous queries was "How many rows are there?" The COUNT() aggregate enables you to use the asterisk as an argument to ask for a count of rows:

select count(*)
from Employees

Results:

-----------
          9

COUNT(*) is faster than COUNT(EmployeeID). SQL Server doesn't need to open up the row, and then find and parse the EmployeeID. It simply checks whether the row matches the search conditions, and then increments its internal counter.

When should you use COUNT() with a column name? When the column might contain nulls. In that case, you can count how many rows contain non-null values for that column. This query determines how many customers have a non-null value in the Region column:

select count(Region) "Customers with Region"
  from Customers

Results:

Customers with Region
---------------------
                   31

(1 row(s) affected)

Warning: Null value eliminated by aggregate or other set
operation.

The server warned me that the aggregate does not include some rows because the COUNT(Region) function skipped them.

Using COUNT_BIG()

The COUNT_BIG() aggregate function was introduced as part of the SQL Server 2000 release. It is intended to help in applications where row counts exceed the 2.1 billion limit imposed by the integer (int) data type.

The only difference between the COUNT_BIG() and COUNT() functions is their return type. The COUNT() function returns an int, so it can go as high as 2.1 billion (2^31). COUNT_BIG() returns a bigint, so it can count up to 9,223,372,036,854,775,808 (2^63 or 9 quintillion!) rows.

Implementing Error Handling for Aggregates

The most common error with aggregates is arithmetic overflow during a sum or count operation. Arithmetic overflow occurs when the number assigned to a variable or column is out of the range of possible values for that data type. For example, a smallint data type is limited to +/- 32,677. Inserting a value of 40,000 into a smallint column will cause the insert statement to fail and report an arithmetic overflow.

SQL Server tries to avoid overflows by automatically changing numeric data types to the highest related data type. For example, if you are summing smallint values, the server automatically returns an integer (int). The greater range of the integer data type will usually prevent an overflow.

Numeric data types allow you to define both a precision (the number of digits in the number) and a scale (the number of decimal positions). When you sum numeric data types, the server automatically returns a sum with the highest number of digits that the data type can hold (38).

You will learn all about data types on Day 8, "Defining Data," when you learn how to define tables and columns.

Understanding Aggregates and Null

Aggregates act as if nulls don't exist. Nulls are not counted as zeros for sums or averages. They are ignored for mins and maxes. And they will be ignored for COUNT().

To clarify this point, let's use a very simple temporary table called #Samples. Listing 4.1 provides the script to create and populate this table.

Code Listing 4.1. Script to Create and Populate a #Samples Temporary Table
 1: set nocount on
 2: go
 3: create table #Samples
 4: (
 5:     id int primary key,
 6:     value int null
 7: )
 8: go
 9: insert #Samples (id, value) values (1, 10)
10: insert #Samples (id, value) values (2, 15)
11: insert #Samples (id, value) values (3, 20)
12: insert #Samples (id, value) values (4, null)
13: go
14: set nocount off
15: go
16: select * from #Samples

Here's the result returned by the last SELECT statement:

      id          value
----------- -----------
          1          10
          2          15
          3          20
          4 NULL
						

Now we can work with the #Samples table to experiment with aggregates and null. Run the query in Listing 4.2 to see how aggregates treat the null in the value column:

Code Listing 4.2. How Aggregates Treat Null Values
1: select
2:       sum(value) as 'Sum',
3:       avg(value) as 'Avg',
4:       min(value) as 'Min',
5:       max(value) as 'Max',
6:       count(value) as 'Count',
7:       count(*) as 'Count(*)'
8: from #Samples

Results:

Sum      Avg      Min      Max      Count    Count(*)
-------- -------- -------- -------- -------- ---------
      45       15       10       20        3         4

(1 row(s) affected)

Warning: Null value eliminated by aggregate or other set
operation.

The null value in the fourth row is not included in the aggregate results. Sum is 45 because the null in row 4 is ignored. Notice that null is neither the max value nor the min (null is not an implied zero). COUNT() returns 3 because only three rows have non-null values in them. Notice that the average is computed by deriving SUM(value) and COUNT(value), and then finding the quotient (SUM/COUNT).

Why does COUNT(*) return 4 when one of the rows is null? count(*)counts only rows, not columns or other expressions.

Don't forget to drop the temporary table when you are finished using it.

drop table #Samples

Computing Aggregates with DISTINCT

You can use DISTINCT to compute aggregates of unique values for a column. DISTINCT appears most commonly with COUNT(), although it works the same way with the other aggregate functions.

The expression COUNT(DISTINCT UnitPrice) asks SQL Server to determine the number of unique values that are in a set of rows for the UnitPrice column. Use COUNT(DISTINCT <column name>) to answer questions such as "How many customers ordered in November 1997? How many dates are there in 1998 when we had sales? How many users visited our Web site?" Use DISTINCT when you are looking for the number of unique instances in a place where duplicates are permitted.

How many customers ordered in November 1997?

select count(distinct CustomerID) 'Customers'
  from Orders
 where OrderDate >= '11/1/1997'
   and OrderDate < '12/1/1997'

Results:

Customers
-----------
         27

How many dates are there in 1998 when we had sales?

select count(distinct OrderDate) as 'Number of Dates in 1998'
  from Orders
 where OrderDate >= '1/1/1998'
   and OrderDate < '1/1/1999'

Results:

Number of Dates in 1998
-----------------------
                     90
						

Tip

Remember that the sample data in the Northwind database is not representative of real life. The dates have no time component, which probably won't be true for all the dates in the databases you will build and support.

To find out the number of unique dates in a table where the dates include a non-zero time component, you need to change the argument for the COUNT() function to strip off the time component. Here's the query in real life:

select count(distinct convert(varchar(12), OrderDate, 101))
as 'Number of Dates in 1998'
  from Orders
 where OrderDate >= '1/1/1998'
   and OrderDate < '1/1/1999'

The convert function uses format 101 to return the date as a string in the form MM/DD/YYYY. If there were a time component, this conversion process would strip off the time, leaving only the date.

As it turns out, the formatted strings are never converted back to a date data type, but that doesn't matter. The query is interested only in how many different values appear in the table, not in the content of the data.


Avoid using DISTINCT with MIN() and MAX(). There is a performance penalty for using DISTINCT. The server must perform a sorting and elimination step before it can do the aggregate processing. That extra step is expensive, and MIN() and MAX() always return the same result with or without DISTINCT.

Reviewing Aggregates

In the next section, you will learn to generate subtotals and other sub-values by using aggregates while grouping the data. As you continue to work with these functions, remember these key points about aggregate functions:

  • Aggregate functions return one value representing one or many rows.

  • Aggregates never include nulls.

  • Aggregates are not permitted in the WHERE clause.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset