The recipes in this book take full advantage of the window functions added to the ISO SQL standard in 2003, as well as vendor-specific window functions. This appendix is meant to serve as a brief overview of how window functions work. Window functions make many typically difficult tasks (difficult to solve using standard SQL, that is) quite easy. For a complete list of window functions available, full syntax, and in-depth coverage of how they work, please consult your vendor’s documentation.
Before moving on to window functions, it is crucial that you understand how grouping works in SQL -the concept of grouping results in SQL can be difficult to master . The problems stem from not fully understanding how the GROUP BY clause works and why certain queries return certain results when using GROUP BY.
Simply stated, grouping is a way to organize like rows together. When you use GROUP BY in a query, each row in the result set is a group and represents one or more rows with the same values in one or more columns that you specify. That’s the gist of it.
If a group is simply a unique instance of a row that represents one or more rows with the same value for a particular column (or columns), then practical examples of groups from table EMP include all employees in department 10 (the common value for these employees that enable them to be in the same group is DEPTNO=10) or all clerks (the common value for these employees that enable them to be in the same group is JOB=CLERK). Consider the following queries. The first shows all employees in department 10; the second query groups the employees in department 10 and returns the following information about the group: the number of rows (members) in the group, the highest salary, and the lowest salary:
select deptno,ename
from emp
where deptno=10
DEPTNO ENAME ------ ---------- 10 CLARK 10 KING 10 MILLERselect deptno,
count(*) as cnt,
max(sal) as hi_sal,
min(sal) as lo_sal
from emp
where deptno=10
group by deptno
DEPTNO CNT HI_SAL LO_SAL ------ ---------- ---------- ---------- 10 3 5000 1300
If you were not able to group the employees in department 10 together, to get the information in the second query above you would have to manually inspect the rows for that department (trivial if there are only three rows, but what if there were three million rows?). So, why would anyone want to group? Reasons for doing so vary; perhaps you want to see how many different groups exist or how many members (rows) are in each group. As you can see from the simple example above, grouping allows you to get information about many rows in a table without having to inspect them one by one.
In mathematics, a group is defined, for the most part, as (G, •, e), where G is a set, • is a binary operation in G, and e is a member of G. We will use this definition as the foundation for what a SQL group is. A SQL group will be defined as (G, e), where G is a result set of a single or self-contained query that uses GROUP BY, e is a member of G, and the following axioms are satisfied:
For each e in G, e is distinct and represents one or more instances of e.
For each e in G, the aggregate function COUNT returns a value > 0.
The result set is included in the definition of a SQL group to reinforce the fact that we are defining what groups are when working with queries only. Thus, it would be accurate to replace “e” in each axiom with the word “row” because the rows in the result set are technically the groups.
Because these properties are fundamental to what we consider a group, it is important that we prove they are true (and we will proceed to do so through the use of some example SQL queries).
By its very definition, a group must have at least one member (or row). If we accept this as a truth, then it can be said that a group cannot be created from an empty table. To prove that proposition true, simply try to prove it is false. The following example creates an empty table, and then attempts to create groups via three different queries against that empty table:
create table fruits (name varchar(10))
select name
from fruits
group by name
(no rows selected)select count(*) as cnt
from fruits
group by name
(no rows selected)select name, count(*) as cnt
from fruits
group by name
(no rows selected)
As you can see from these queries, it is impossible to create what SQL considers a group from an empty table.
Now let’s prove that the groups created via queries with a GROUP BY clause are distinct. The following example inserts five rows into table FRUITS, and then creates groups from those rows:
insert into fruits values ('Oranges')
insert into fruits values ('Oranges')
insert into fruits values ('Oranges')
insert into fruits values ('Apple')
insert into fruits values ('Peach')
select *
from fruits
NAME -------- Oranges Oranges Oranges Apple Peachselect name
from fruits
group by name
NAME ------- Apple Oranges Peachselect name, count(*) as cnt
from fruits
group by name
NAME CNT ------- -------- Apple 1 Oranges 3 Peach 1
The first query shows that “Oranges” occurs three times in table FRUITS. However, the second and third queries (using GROUP BY) return only one instance of “Oranges.” Taken together, these queries prove that the rows in the result set (e in G, from our definition) are distinct, and each value of NAME represents one or more instances of itself in table FRUITS.
Knowing that groups are distinct is important because it means, typically, you would not use the DISTINCT keyword in your SELECT list when using a GROUP BY in your queries.
We don’t pretend GROUP BY and DISTINCT are the same. They represent two completely different concepts. We do state that the items listed in the GROUP BY clause will be distinct in the result set and that using DISTINCT as well as GROUP BY is redundant.
The queries and results in the preceding section also prove the final axiom that the aggregate function COUNT will never return zero when used in a query with GROUP BY on a nonempty table. It should not be surprising that you cannot return a count of zero for a group. We have already proved that a group cannot be created from an empty table, thus a group must have at least one row. If at least one row exists, then the count will always be at least 1.
Remember, we are talking about using COUNT with GROUP BY, not COUNT by itself. A query using COUNT without a GROUP BY on an empty table will of course return zero.
“Hardly anything more unfortunate can befall a scientific writer than to have one of the foundations of his edifice shaken after the work is finished…. This was the position I was placed in by a letter of Mr. Bertrand Russell, just when the printing of this volume was nearing its completion.”
The preceding quote is from Gottlob Frege in response to Bertrand Russell’s discovery of a contradiction to Frege’s axiom of abstraction in set theory.
Paradoxes many times provide scenarios that would seem to contradict established theories or ideas. In many cases these contradictions are localized and can be “worked around,” or they are applicable to such small test cases that they can be safely ignored.
You may have guessed by now that the point to all this discussion of paradoxes is that there exists a paradox concerning our definition of an SQL group, and that paradox must be addressed. Although our focus right now is on groups, ultimately we are discussing SQL queries. In its GROUP BY clause, a query may have a wide range of values such as constants, expressions, or, most commonly, columns from a table. We pay a price for this flexibility, because NULL is a valid “value” in SQL. NULLs present problems because they are effectively ignored by aggregate functions. With that said, if a table consists of a single row and its value is NULL, what would the aggregate function COUNT return when used in a GROUP BY query? By our very definition, when using GROUP BY and the aggregate function COUNT, a value >= 1 must be returned. What happens, then, in the case of values ignored by functions such as COUNT, and what does this mean to our definition of a GROUP? Consider the following example, which reveals the NULL group paradox (using the function COALESCE when necessary for readability):
select *
from fruits
NAME ------- Oranges Oranges Oranges Apple Peachinsert into fruits values (null)
insert into fruits values (null)
insert into fruits values (null)
insert into fruits values (null)
insert into fruits values (null)
select coalesce(name,'NULL') as name
from fruits
NAME -------- Oranges Oranges Oranges Apple Peach NULL NULL NULL NULL NULLselect coalesce(name,'NULL') as name,
count(name) as cnt
from fruits
group by name
NAME CNT -------- ---------- Apple 1 NULL 0 Oranges 3 Peach 1
It would seem that the presence of NULL values in our table introduces a contradiction, or paradox, to our definition of a SQL group. Fortunately, this contradiction is not a real cause for concern, because the paradox has more to do with the implementation of aggregate functions than our definition. Consider the final query in the preceding set; a general problem statement for that query would be:
Count the number of times each name occurs in table FRUITS or count the number of members in each group.
Examining the INSERT statements above, it’s clear that there are five rows with NULL values, which means there exists a NULL group with five members.
While NULL certainly has properties that differentiate it from other values, it is nevertheless a value, and can in fact be a group.
How, then, can we write the query to return a count of 5 instead of 0, thus returning the information we are looking for while conforming to our definition of a group? The example below shows a workaround to deal with the NULL group paradox:
select coalesce(name,'NULL') as name,
count(*) as cnt
from fruits
group by name
NAME CNT --------- -------- Apple 1 Oranges 3 Peach 1 NULL 5
The workaround is to use COUNT() rather than COUNT(NAME) to avoid the NULL group paradox. Aggregate functions will ignore NULL values if any exist in the column passed to them. Thus, to avoid a zero when using COUNT do not pass the column name; instead, pass in an asterisk (). The * causes the COUNT function to count rows rather than the actual column values, so whether or not the actual values are NULL or not NULL is irrelevant.
One more paradox has to do with the axiom that each group in a result set (for each e in G) is distinct. Because of the nature of SQL result sets and tables, which are more accurately defined as multisets or “bags,” not sets (because duplicate rows are allowed), it is possible to return a result set with duplicate groups. Consider the following queries:
select coalesce(name,'
NULL') as name,
count(*) as cnt
from fruits
group by name
union all
select coalesce(name,'NULL') as name,
count(*) as cnt
from fruits
group by name
NAME CNT ---------- --------- Apple 1 Oranges 3 Peach 1 NULL 5 Apple 1 Oranges 3 Peach 1 NULL 5select x.*
from (
select coalesce(name,'NULL') as name,
count(*) as cnt
from fruits
group by name
) x,
(select deptno from dept) y
NAME CNT ---------- ---------- Apple 1 Apple 1 Apple 1 Apple 1 Oranges 3 Oranges 3 Oranges 3 Oranges 3 Peach 1 Peach 1 Peach 1 Peach 1 NULL 5 NULL 5 NULL 5 NULL 5
As you can see in these queries, the groups are in fact repeated in the final results. Fortunately, this is not much to worry about because it represents only a partial paradox. The first property of a group states that for (G, e), G is a result set from a single or self-contained query that uses GROUP BY. Simply put, the result set from any GROUP BY query itself conforms to our definition of a group. It is only when you combine the result sets from two GROUP BY queries to create a multiset that groups may repeat. The first query in the preceding example uses UNION ALL, which is not a set operation but a multiset operation, and invokes GROUP BY twice, effectively executing two queries.
If you use UNION, which is a set operation, you will not see repeating groups.
The second query in the preceding set uses a Cartesian product, which only works if you materialize the group first and then perform the Cartesian. Thus the GROUP BY query when self-contained conforms to our definition. Neither of the two examples takes anything away from the definition of a SQL group. They are shown for completeness, and so that you can be aware that almost anything is possible in SQL.
With the concept of a group defined and proved, it is now time to move on to more practical matters concerning queries using GROUP BY. It is important to understand the relationship between the SELECT clause and the GROUP BY clause when grouping in SQL. It is important to keep in mind when using aggregate functions such as COUNT that any item in your SELECT list that is not used as an argument to an aggregate function must be part of your group. For example, if you write a SELECT clause such as:
select deptno, count(*) as cnt from emp
then you must list DEPTNO in your GROUP BY clause:
select deptno, count(*) as cnt
from emp
group by deptno
DEPTNO CNT ------- ---- 10 3 20 5 30 6
Constants, scalar values returned by user-defined functions, window functions, and non-correlated scalar subqueries are exceptions to this rule. Since the SELECT clause is evaluated after the GROUP BY clause, these constructs are allowed in the SELECT list and do not have to (and in some cases cannot) be specified in the GROUP BY clause. For example:
select 'hello' as msg,
1 as num,
deptno,
(
select count(*) from emp) as total,
count(*) as cnt
from emp
group by deptno
MSG NUM DEPTNO TOTAL CNT ----- --- ------ ----- --- hello 1 10 14 3 hello 1 20 14 5 hello 1 30 14 6
Don’t let this query confuse you. The items in the SELECT list not listed in the GROUP BY clause do not change the value of CNT for each DEPTNO, nor do the values for DEPTNO change. Based on the results of the preceding query, we can define the rule about matching items in the SELECT list and the GROUP BY clause when using aggregates a bit more precisely:
Items in a SELECT list that can potentially change the group or change the value returned by an aggregate function must be included in the GROUP BY clause.
The additional items in the preceding SELECT list did not change the value of CNT for any group (each DEPTNO), nor did they change the groups themselves.
Now it’s fair to ask: exactly what items in a SELECT list can change a grouping or the value returned by an aggregate function? The answer is simple: other columns from the table(s) you are selecting from. Consider the prospect of adding the JOB column to the query we’ve been looking at:
select deptno, job, count(*) as cnt
from emp
group by deptno, job
DEPTNO JOB CNT ------ ---------- ---- 10 CLERK 1 10 MANAGER 1 10 PRESIDENT 1 20 CLERK 2 20 ANALYST 2 20 MANAGER 1 30 CLERK 1 30 MANAGER 1 30 SALESMAN 4
By listing another column, JOB, from table EMP, we are changing the group and changing the result set; thus we must now include JOB in the GROUP BY clause along with DEPTNO, otherwise the query will fail. The inclusion of JOB in the SELECT/GROUP BY clauses changes the query from “How many employees are in each department?” to “How many different types of employees are in each department?” Notice again that the groups are distinct; the values for DEPTNO and JOB individually are not distinct, but the combination of the two (which is what is in the GROUP BY and SELECT list, and thus is the group) are distinct (e.g., 10 and CLERK appear only once).
If you choose not to put items other than aggregate functions in the SELECT list, then you may list any valid column you wish, in the GROUP BY clause. Consider the following two queries, which highlight this fact:
select count(*)
from emp
group by deptno
COUNT(*) --------- 3 5 6select count(*)
from emp
group by deptno,job
COUNT(*) ---------- 1 1 1 2 2 1 1 1 4
Including items other than aggregate functions in the SELECT list is not mandatory, but often improves readability and usability of the results.
As a rule, when using GROUP BY and aggregate functions, any items in the SELECT list [from the table(s) in the FROM clause] not used as an argument to an aggregate function must be included in the GROUP BY clause. However, MySQL has a “feature” that allows you to deviate from this rule, allowing you to place items in your SELECT list [that are columns in the table(s) you are selecting from] that are not used as arguments to an aggregate function and that are not present in your GROUP BY clause. We use the term “feature” very loosely here as its use is a bug waiting to happen - to avoid it. As a matter of fact, if you use MySQL and care at all about the accuracy of your queries we suggest you urge them to remove this, ahem, “feature.”
Once you understand the concept of grouping and using aggregates in SQL, understanding window functions is easy. Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group. The group of rows to perform the aggregation on is the window (hence the name “window functions”). DB2 actually calls such functions online analytic processing (OLAP) functions, and Oracle calls them analytic functions, but the ISO SQL standard calls them window functions, so that’s the term used in this book.
Let’s say that you wish to count the total number of employees across all departments. The traditional method for doing that is to issue a COUNT(*) query against the entire EMP table:
select count(*) as cnt
from emp
CNT ----- 14
This is easy enough, but often you will find yourself wanting to access such aggregate data from rows that do not represent an aggregation, or that represent a different aggregation. Window functions make light work of such problems. For example, the following query shows how you can use a window function to access aggregate data (the total count of employees) from detail rows (one per employee):
select ename,
deptno,
count(*) over() as cnt
from emp
order by 2
ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 14 KING 10 14 MILLER 10 14 SMITH 20 14 ADAMS 20 14 FORD 20 14 SCOTT 20 14 JONES 20 14 ALLEN 30 14 BLAKE 30 14 MARTIN 30 14 JAMES 30 14 TURNER 30 14 WARD 30 14
The window function invocation in this example is COUNT(*) OVER(). The presence of the OVER keyword indicates that the invocation of COUNT will be treated as a window function, not as an aggregate function. In general, the SQL standard allows for all aggregate functions to also be window functions, and the keyword OVER is how the language distinguishes between the two uses.
So, what did the window function COUNT(*) OVER () do exactly? For every row being returned in the query, it returned the count of all the rows in the table. As the empty parentheses suggest, the OVER keyword accepts additional clauses to affect the range of rows that a given window function considers. Absent any such clauses, the window function looks at all rows in the result set, which is why you see the value 14 repeated in each row of output.
Hopefully you begin to see the great utility of window functions, which is that they allow you to work with multiple levels of aggregation in one row. As you continue through this appendix, you’ll begin to see even more just how incredibly useful that ability can be.
Before digging deeper into the OVER clause, it is important to note that window functions are performed as the last step in SQL processing prior to the ORDER BY clause. As an example of how window functions are processed last, let’s take the query from the preceding section and use a WHERE clause to filter out employees from DEPTNO 20 and 30:
select ename,
deptno,
count(*) over() as cnt
from emp
where deptno = 10
order by 2
ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3
The value for CNT for each row is no longer 14, it is now 3. In this example, it is the WHERE clause that restricts the result set to three rows, hence the window function will count only three rows (there are only three rows available to the window function by the time processing reaches the SELECT portion of the query). From this example you can see that window functions perform their computations after clauses such as WHERE and GROUP BY are evaluated.
Use the PARTITION BY clause to define a partition or group of rows to perform an aggregation over. As we’ve seen already, if you use empty parentheses then the entire result set is the partition that a window function aggregation will be computed over. You can think of the PARTITION BY clause as a “moving GROUP BY” because unlike a traditional GROUP BY, a group created by PARTITION BY is not distinct in a result set. You can use PARTITION BY to compute an aggregation over a defined group of rows (resetting when a new group is encountered) and rather than having one group represent all instances of that value in the table, each value (each member in each group) is returned. Consider the following query:
select ename,
deptno,
count(*) over(
partition by deptno) as cnt
from emp
order by 2
ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3 SMITH 20 5 ADAMS 20 5 FORD 20 5 SCOTT 20 5 JONES 20 5 ALLEN 30 6 BLAKE 30 6 MARTIN 30 6 JAMES 30 6 TURNER 30 6 WARD 30 6
This query still returns 14 rows, but now the COUNT is performed for each department as a result of the PARTITION BY DEPTNO clause. Each employee in the same department (in the same partition) will have the same value for CNT, because the aggregation will not reset (recompute) until a new department is encountered. Also note that you are returning information about each group, along with the members of each group. You can think of the preceding query as a more efficient version of the following:
select e.ename,
e.deptno,
(select count(*) from emp d
where e.deptno=d.deptno) as cnt
from emp e
order by 2
ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3 SMITH 20 5 ADAMS 20 5 FORD 20 5 SCOTT 20 5 JONES 20 5 ALLEN 30 6 BLAKE 30 6 MARTIN 30 6 JAMES 30 6 TURNER 30 6 WARD 30 6
Additionally, what’s nice about the PARTITION BY clause is that it performs its computations independently of other window functions, partitioning by different columns in the same SELECT statement. Consider the following query, which returns each employee, her department, the number of employees in her respective department, her job, and the number of employees with the same job:
select ename,
deptno,
count(*) over(partition by deptno) as dept_cnt,
job,
count(*) over(partition by job) as job_cnt
from emp
order by 2
ENAME DEPTNO DEPT_CNT JOB JOB_CNT ---------- ------ -------- --------- ------- MILLER 10 3 CLERK 4 CLARK 10 3 MANAGER 3 KING 10 3 PRESIDENT 1 SCOTT 20 5 ANALYST 2 FORD 20 5 ANALYST 2 SMITH 20 5 CLERK 4 JONES 20 5 MANAGER 3 ADAMS 20 5 CLERK 4 JAMES 30 6 CLERK 4 MARTIN 30 6 SALESMAN 4 TURNER 30 6 SALESMAN 4 WARD 30 6 SALESMAN 4 ALLEN 30 6 SALESMAN 4 BLAKE 30 6 MANAGER 3
In this result set, you can see that employees in the same department have the same value for DEPT_CNT, and that employees who have the same job position have the same value for JOB_CNT.
By now it should be clear that the PARTITION BY clause works like a GROUP BY clause, but it does so without being affected by the other items in the SELECT clause and without requiring you to write a GROUP BY clause.
Like the GROUP BY clause, the PARTITION BY clause lumps all the NULLs into one group or partition. Thus, the effect from NULLs when using PARTITION BY is similar to that from using GROUP BY. The following query uses a window function to count the number of employees with each distinct commission (returning–1 in place of NULL for readability):
select coalesce(comm,-1) as comm,
count(*)over(
partition by comm) as cnt
from emp
COMM CNT ------ ---------- 0 1 300 1 500 1 1400 1 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10
Because COUNT(*) is used, the function counts rows. You can see that there are 10 employees having NULL commissions. Use COMM instead of *, however, and you get quite different results:
select coalesce(comm,-1) as comm,
count(comm)over(partition by comm) as cnt
from emp
COMM CNT ---- ---------- 0 1 300 1 500 1 1400 1 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0
This query uses COUNT(COMM), which means that only the non-NULL values in the COMM column are counted. There is one employee with a commission of 0, one employee with a commission of 300, and so forth. But notice the counts for those with NULL commissions! Those counts are 0. Why? Because aggregate functions ignore NULL values, or more accurately, aggregate functions count only non-NULL values.
When using COUNT, consider whether you wish to include NULLs. Use COUNT(column) to avoid counting NULLs. Use COUNT(*) if you do wish to include NULLs (since you are no longer counting actual column values, you are counting rows).
Sometimes the order in which rows are treated by a window function is material to the results that you wish to obtain from a query. For this reason, window function syntax includes an ORDER BY subclause that you can place within an OVER clause. Use the ORDER BY clause to specify how the rows are ordered with a partition (remember, “partition” in the absence of a PARTITION BY clause means the entire result set).
Some window functions require you to impose order on the partitions of rows being affected. Thus, for some window functions an ORDER BY clause is mandatory. At the time of this writing, SQL Server does not allow ORDER BY in the OVER clause when used with aggregate window functions. SQL Server does permit ORDER BY in the OVER clause when used with window ranking functions.
When you use an ORDER BY clause in the OVER clause of a window function you are specifying two things:
How the rows in the partition are ordered
What rows are included in the computation
Consider the following query, which sums and computes a running total of salaries for employees in DEPTNO 10:
select deptno,
ename,
hiredate,
sal,
sum(sal)over(partition by deptno) as total1,
sum(sal)over() as total2,
sum(sal)over(order by hiredate) as running_total
from emp
where deptno=10
DEPTNO ENAME HIREDATE SAL TOTAL1 TOTAL2 RUNNING_TOTAL ------ ------ ----------- ----- ------ ------ ------------- 10 CLARK 09-JUN-1981 2450 8750 8750 2450 10 KING 17-NOV-1981 5000 8750 8750 7450 10 MILLER 23-JAN-1982 1300 8750 8750 8750
Just to keep you on your toes, I’ve included a sum with empty parentheses. Notice how TOTAL1 and TOTAL2 have the same values. Why? Once again, the order in which window functions are evaluated answers the question. The WHERE clause filters the result set such that only salaries from DEPTNO 10 are considered for summation. In this case there is only one partition—the entire result set, which consists of only salaries from DEPTNO 10. Thus TOTAL1 and TOTAL2 are the same.
Looking at the values returned by column SAL, you can easily see where the values for RUNNING_TOTAL come from. You can eyeball the values and add them yourself to compute the running total. But more importantly, why did including an ORDER BY in the OVER clause create a running total in the first place? The reason is, when you use ORDER BY in the OVER clause you are specify a default “moving” or “sliding” window within the partition even though you don’t see it. The ORDER BY HIREDATE clause terminates summation at the HIREDATE in the current row.
The following query is the same as the previous one, but uses the RANGE BETWEEN clause (which you’ll learn more about later) to explicitly specify the default behavior that results from ORDER BY HIREDATE:
select deptno,
ename,
hiredate,
sal,
sum(sal)over(partition by deptno) as total1,
sum(sal)over() as total2,
sum(sal)over(order by hiredate
range between unbounded preceding
and current row) as running_total
from emp
where deptno=10
DEPTNO ENAME HIREDATE SAL TOTAL1 TOTAL2 RUNNING_TOTAL ------ ------ ----------- ----- ------ ------ ------------- 10 CLARK 09-JUN-1981 2450 8750 8750 2450 10 KING 17-NOV-1981 5000 8750 8750 7450 10 MILLER 23-JAN-1982 1300 8750 8750 8750
The RANGE BETWEEN clause that you see in this query is termed the framing clause by ANSI and we’ll use that term here. Now, it should be easy to see why specifying an ORDER BY in the OVER clause created a running total; we’ve (by default) told the query to sum all rows starting from the current row and include all prior rows (“prior” as defined in the ORDER BY, in this case ordering the rows by HIREDATE).
Let’s apply the framing clause from the preceding query to the result set, starting with the first employee hired, who is named CLARK.
Starting with CLARK’s salary, 2450, and including all employees hired before CLARK, compute a sum. Since CLARK was the first employee hired in DEPTNO 10, the sum is simply CLARK’s salary, 2450, which is the first value returned by RUNNING_TOTAL.
Let’s move to the next employee based on HIREDATE, named KING, and apply the framing clause once again. Compute a sum on SAL starting with the current row, 5000 (KING’s salary), and include all prior rows (all employees hired before KING). CLARK is the only one hired before KING so the sum is 5000 + 2450, which is 7450, the second value returned by RUNNING_TOTAL.
Moving on to MILLER, the last employee in the partition based on HIREDATE, let’s one more time apply the framing clause. Compute a sum on SAL starting with the current row, 1300 (MILLER’s salary), and include all prior rows (all employees hired before MILLER). CLARK and KING were both hired before MILLER, and thus their salaries are included in MILLER’s RUNNING_TOTAL: 2450 + 5000 + 1300 is 8750, which is the value for RUNNING_TOTAL for MILLER.
As you can see, it is really the framing clause that produces the running total. The ORDER BY defines the order of evaluation and happens to also imply a default framing.
In general, the framing clause allows you to define different “sub-windows” of data to include in your computations. There are many ways to specify such sub-windows. Consider the following query:
select deptno,
ename,
sal,
sum(sal)over(order by hiredate
range between unbounded preceding
and current row) as run_total1,
sum(sal)over(order by hiredate
rows between 1 preceding
and current row) as run_total2,
sum(sal)over(order by hiredate
range between current row
and unbounded following) as run_total3,
sum(sal)over(order by hiredate
rows between current row
and 1 following) as run_total4
from emp
where deptno=10
DEPTNO ENAME SAL RUN_TOTAL1 RUN_TOTAL2 RUN_TOTAL3 RUN_TOTAL4 ------ ------ ----- ---------- ---------- ---------- ---------- 10 CLARK 2450 2450 2450 8750 7450 10 KING 5000 7450 7450 6300 6300 10 MILLER 1300 8750 6300 1300 1300
Don’t be intimidated here; this query is not as bad as it looks. You’ve already seen RUN_TOTAL1 and the effects of the framing clause “UNBOUNDED PRECEDING AND CURRENT ROW”. Here’s a quick description of what’s happening in the other examples:
Rather than the keyword RANGE, this framing clause specifies ROWS, which means the frame, or window, is going to be constructed by counting some number of rows. The 1 PRECEDING means that the frame will begin with the row immediately preceding the current row. The range continues through the CUR-RENT ROW. So what you get in RUN_TOTAL2 is the sum of the current employee’s salary and that of the preceding employee, based on HIREDATE.
It so happens that RUN_TOTAL1 and RUN_TOTAL2 are the same for both CLARK and KING. Why? Think about which values are being summed for each of those employees, for each of the two window functions. Think carefully, and you’ll get the answer.
The window function for RUN_TOTAL3 works just the opposite of that for RUN_TOTAL1; rather than starting with the current row and including all prior rows in the summation, summation begins with the current row and includes all subsequent rows in the summation.
Is inverse of RUN_TOTAL2; rather than starting from the current row and including one prior row in the summation, start with the current row and include one subsequent row in the summation.
If you can understand what’s been explained thus far, you will have no problem with any of the recipes in this book. If you’re not catching on, though, try practicing with your own examples and your own data. It’s usually easier to learn by coding new features rather than just reading about them.
As a final example of the effect of the framing clause on query output, consider the following query:
select ename,
sal,
min(sal)over(order by sal) min1,
max(sal)over(order by sal) max1,
min(sal)over(order by sal
range between unbounded preceding
and unbounded following) min2,
max(sal)over(order by sal
range between unbounded preceding
and unbounded following) max2,
min(sal)over(order by sal
range between current row
and current row) min3,
max(sal)over(order by sal
range between current row
and current row) max3,
max(sal)over(order by sal
rows between 3 preceding
and 3 following) max4
from emp
ENAME SAL MIN1 MAX1 MIN2 MAX2 MIN3 MAX3 MAX4 ------ ----- ------ ------ ------ ------ ------ ------ ------ SMITH 800 800 800 800 5000 800 800 1250 JAMES 950 800 950 800 5000 950 950 1250 ADAMS 1100 800 1100 800 5000 1100 1100 1300 WARD 1250 800 1250 800 5000 1250 1250 1500 MARTIN 1250 800 1250 800 5000 1250 1250 1600 MILLER 1300 800 1300 800 5000 1300 1300 2450 TURNER 1500 800 1500 800 5000 1500 1500 2850 ALLEN 1600 800 1600 800 5000 1600 1600 2975 CLARK 2450 800 2450 800 5000 2450 2450 3000 BLAKE 2850 800 2850 800 5000 2850 2850 3000 JONES 2975 800 2975 800 5000 2975 2975 5000 SCOTT 3000 800 3000 800 5000 3000 3000 5000 FORD 3000 800 3000 800 5000 3000 3000 5000 KING 5000 800 5000 800 5000 5000 5000 5000
OK, let’s break this query down:
The window function generating this column does not specify a framing clause, so the default framing clause of UNBOUNDED PRECEDING AND CURRENT ROW kicks in. Why is MIN1 800 for all rows? It’s because the lowest salary comes first (ORDER BY SAL), and it remains the lowest, or minimum, salary forever after.
The values for MAX1 are much different from those for MIN1. Why? The answer (again) is the default framing clause UNBOUNDED PRECEDING AND CURRENT ROW. In conjunction with ORDER BY SAL, this framing clause ensures that the maximum salary will also correspond to that of the current row.
Consider the first row, for SMITH. When evaluating SMITH’s salary and all prior salaries, MAX1 for SMITH is SMITH’s salary, because there are no prior salaries. Moving on to the next row, JAMES, when comparing JAMES’ salary to all prior salaries, in this case comparing to the salary of SMITH, JAMES’ salary is the higher of the two, and thus it is the maximum. If you apply this logic to all rows, you will see that the value of MAX1 for each row is the current employee’s salary.
The framing clause given for these is UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, which is the same as specifying empty parentheses. Thus, all rows in the result set are considered when computing MIN and MAX. As you might expect, the MIN and MAX values for the entire result set are constant, and thus the value of these columns is constant as well.
The framing clause for these is CURRENT ROW AND CURRENT ROW, which simply means use only the current employee’s salary when looking for the MIN and MAX salary. Thus both MIN3 and MAX3 are the same as SAL for each row. That was easy, wasn’t it?
The framing clause defined for MAX4 is 3 PRECEDING AND 3 FOLLOWING, which means, for every row, consider the three rows prior and the three rows after the current row, as well as the current row itself. This particular invocation of MAX(SAL) will return from those rows the highest salary value.
If you look at the value of MAX4 for employee MARTIN you can see how the framing clause is applied. MARTIN’s salary is 1250 and the three employee salaries prior to MARTIN’s are WARD’s (1250), ADAMS’ (1100) and JAMES’ (950). The three employee salaries after MARTIN’s are MILLER’s (1300), TURNER’s (1500), and ALLEN’s (1600). Out of all those salaries, including MARTIN’s, the highest is ALLEN’s, and thus the value of MAX4 for MARTIN is 1600.
As you can see, window functions are extremely powerful as they allow you to write queries that contain both detailed and aggregate information. Using window functions allows you to write smaller, more efficient queries as compared to using multiple self join and/or scalar subqueries. Consider the following query, which easily answers all of the following questions: “What is the number of employees in each department? How many different types of employees are in each department (e.g., how many clerks are in department 10)? How many total employees are in table EMP?”
select deptno,
job,
count(*) over (partition by deptno) as emp_cnt,
count(job) over (partition by deptno,job) as job_cnt,
count(*) over () as total
from emp
DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ---------- ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14
To return the same result set without using window functions would require a bit more work:
select a.deptno, a.job,
(select count(*) from emp b
where b.deptno = a.deptno) as emp_cnt,
(select count(*) from emp b
where b.deptno = a.deptno and b.job = a.job) as job_cnt,
(select count(*) from emp) as total
from emp a
order by 1,2
DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ---------- ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14
The non-window solution is obviously not difficult to write, yet it certainly is not as clean or efficient (you won’t see performance differences with a 14-row table, but try these queries with, say, a 1,000- or 10,000-row table and then you’ll see the benefit of using window functions over multiple self joins and scalar subqueries).
Besides readability and performance, window functions are useful for providing a “base” for more complex “report style” queries. For example, consider the following “report style” query that uses window functions in an inline view and then aggregates the results in an outer query. Using window functions allows you to return detailed as well as aggregate data, which is useful for reports. The query below uses window functions to find counts using different partitions. Because the aggregation is applied to multiple rows, the inline view returns all rows from EMP, which the outer CASE expressions can use to transpose and create a formatted report:
select deptno,
emp_cnt as dept_total,
total,
max(case when job = 'CLERK'
then job_cnt else 0 end) as clerks,
max(case when job = 'MANAGER'
then job_cnt else 0 end) as mgrs,
max(case when job = 'PRESIDENT'
then job_cnt else 0 end) as prez,
max(case when job = 'ANALYST'
then job_cnt else 0 end) as anals,
max(case when job = 'SALESMAN'
then job_cnt else 0 end) as smen
from (
select deptno,
job,
count(*) over (partition by deptno) as emp_cnt,
count(job) over (partition by deptno,job) as job_cnt,
count(*) over () as total
from emp
) x
group by deptno, emp_cnt, total
DEPTNO DEPT_TOTAL TOTAL CLERKS MGRS PREZ ANALS SMEN ------ ---------- ----- ------ ---- ---- ----- ---- 10 3 14 1 1 1 0 0 20 5 14 2 1 0 2 0 30 6 14 1 1 0 0 4
The query above returns each department, the total number of employees in each department, the total number of employees in table EMP, and a breakdown of the number of different job types in each department. All this is done in one query, without additional joins or temp tables!
As a final example of how easily multiple questions can be answered using window functions, consider the following query:
select ename as name,
sal,
max(sal)over(partition by deptno) as hiDpt,
min(sal)over(partition by deptno) as loDpt,
max(sal)over(partition by job) as hiJob,
min(sal)over(partition by job) as loJob,
max(sal)over() as hi,
min(sal)over() as lo,
sum(sal)over(partition by deptno
order by sal,empno) as dptRT,
sum(sal)over(partition by deptno) as dptSum,
sum(sal)over() as ttl
from emp
order by deptno,dptRT
NAME SAL HIDPT LODPT HIJOB LOJOB HI LO DPTRT DPTSUM TTL ------ ----- ----- ----- ----- ----- ----- ---- ------ ------ ------ MILLER 1300 5000 1300 1300 800 5000 800 1300 8750 29025 CLARK 2450 5000 1300 2975 2450 5000 800 3750 8750 29025 KING 5000 5000 1300 5000 5000 5000 800 8750 8750 29025 SMITH 800 3000 800 1300 800 5000 800 800 10875 29025 ADAMS 1100 3000 800 1300 800 5000 800 1900 10875 29025 JONES 2975 3000 800 2975 2450 5000 800 4875 10875 29025 SCOTT 3000 3000 800 3000 3000 5000 800 7875 10875 29025 FORD 3000 3000 800 3000 3000 5000 800 10875 10875 29025 JAMES 950 2850 950 1300 800 5000 800 950 9400 29025 WARD 1250 2850 950 1600 1250 5000 800 2200 9400 29025 MARTIN 1250 2850 950 1600 1250 5000 800 3450 9400 29025 TURNER 1500 2850 950 1600 1250 5000 800 4950 9400 29025 ALLEN 1600 2850 950 1600 1250 5000 800 6550 9400 29025 BLAKE 2850 2850 950 2975 2450 5000 800 9400 9400 29025
This query answers the following questions easily, efficiently, and readably (and without additional joins to EMP!). Simply match the employee and her salary with the different rows in the result set to determine:
who makes the highest salary of all employees (HI)
who makes the lowest salary of all employees (LO)
who makes the highest salary in her department (HIDPT)
who makes the lowest salary in her department (LODPT)
who makes the highest salary in her job (HIJOB)
who makes the lowest salary in her job (LOJOB)
what is the sum of all salaries (TTL)
what is the sum of salaries per department (DPTSUM)
what is the running total of all salaries per department (DPTRT)