This chapter is about “everyday” queries that involve ranges. Ranges are common in everyday life. For example, projects that we work on range over consecutive periods of time. In SQL, it’s often necessary to search for ranges, or to generate ranges, or to otherwise manipulate range-based data. The queries you’ll read about here are slightly more involved than the queries found in the preceding chapters, but they are just as common, and they’ll begin to give you a sense of what SQL can really do for you when you learn to take full advantage of it.
You want to determine which rows represent a range of consecutive projects. Consider the following result set from view V, which contains data about a project and its start and end dates:
select * from V PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 3 03-JAN-2020 04-JAN-2020 4 04-JAN-2020 05-JAN-2020 5 06-JAN-2020 07-JAN-2020 6 16-JAN-2020 17-JAN-2020 7 17-JAN-2020 18-JAN-2020 8 18-JAN-2020 19-JAN-2020 9 19-JAN-2020 20-JAN-2020 10 21-JAN-2020 22-JAN-2020 11 26-JAN-2020 27-JAN-2020 12 27-JAN-2020 28-JAN-2020 13 28-JAN-2020 29-JAN-2020 14 29-JAN-2020 30-JAN-2020
Excluding the first row, each row’s PROJ_START should equal the PROJ_END of the row before it (“before” is defined as PROJ_ID–1 for the current row). Examining the first five rows from view V, PROJ_IDs 1 through 3 are part of the same “group” as each PROJ_END equals the PROJ_START of the row after it. Because you want to find the range of dates for consecutive projects, you would like to return all rows where the current PROJ_END equals the next row’s PROJ_START. If the first five rows comprised the entire result set, you would like to return only the first three rows. The final result set (using all 14 rows from view V) should be:
PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 3 03-JAN-2020 04-JAN-2020 6 16-JAN-2020 17-JAN-2020 7 17-JAN-2020 18-JAN-2020 8 18-JAN-2020 19-JAN-2020 11 26-JAN-2020 27-JAN-2020 12 27-JAN-2020 28-JAN-2020 13 28-JAN-2020 29-JAN-2020
The rows with PROJ_IDs 4,5,9,10, and 14 are excluded from this result set because the PROJ_END of each of these rows does not match the PROJ_START of the row following it.
This solution takes best advantage of the window function LEAD OVER to look at the “next” row’s BEGIN_DATE, thus avoiding the need to self join, which was necessary before window functions were widely introduced:
1 select proj_id, proj_start, proj_end 2 from ( 3 select proj_id, proj_start, proj_end, 4 lead(proj_start)over(order by proj_id) next_proj_start 5 from V 6 ) alias 7 where next_proj_start = proj_end
Although it is possible to develop a solution using a self-join, the window function LEAD OVER is perfect for this type of problem, and more intuitive. The function LEAD OVER allows you to examine other rows without performing a self join (though the function must impose order on the result set to do so). Consider the results of the inline view (lines 3–5) for IDs 1 and 4:
select *
from (
select proj_id, proj_start, proj_end,
lead(proj_start)over(order by proj_id) next_proj_start
from v
)
where proj_id in ( 1, 4 )
PROJ_ID PROJ_START PROJ_END NEXT_PROJ_START ------- ----------- ----------- --------------- 1 01-JAN-2020 02-JAN-2020 02-JAN-2020 4 04-JAN-2020 05-JAN-2020 06-JAN-2020
Examining the above snippet of code and its result set, it is particularly easy to see why PROJ_ID 4 is excluded from the final result set of the complete solution. It’s excluded because its PROJ_END date of 05-JAN-2005 does not match the “next” project’s start date of 06-JAN-2005.
The function LEAD OVER is extremely handy when it comes to problems such as this one, particularly when examining partial results. When working with window functions, keep in mind that they are evaluated after the FROM and WHERE clauses, so the LEAD OVER function in the preceding query must be embedded within an inline view. Otherwise the LEAD OVER function is applied to the result set after the WHERE clause has filtered out all rows except for PROJ_ID’s 1 and 4.
Now, depending on how you view the data, you may very well want to include PROJ_ID 4 in the final result set. Consider the first five rows from view V:
select *
from V
where proj_id <= 5
PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 3 03-JAN-2020 04-JAN-2020 4 04-JAN-2020 05-JAN-2020 5 06-JAN-2020 07-JAN-2020
If your requirement is such that PROJ_ID 4 is in fact contiguous (because PROJ_ START for PROJ_ID 4 matches PROJ_END for PROJ_ID 3), and that only PROJ_ ID 5 should be discarded, the proposed solution for this recipe is incorrect (!), or at the very least, incomplete:
select proj_id, proj_start, proj_end
from (
select proj_id, proj_start, proj_end,
lead(proj_start)over(order by proj_id) next_start
from V
where proj_id <= 5
)
where proj_end = next_start
PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 3 03-JAN-2020 04-JAN-2020
If you believe PROJ_ID 4 should be included, simply add LAG OVER to the query and use an additional filter in the WHERE clause:
select proj_id, proj_start, proj_end
from (
select proj_id, proj_start, proj_end,
lead(proj_start)over(order by proj_id) next_start,
lag(proj_end)over(order by proj_id) last_end
from V
where proj_id <= 5
)
where proj_end = next_start
or proj_start = last_end
PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2005 02-JAN-2005 2 02-JAN-2005 03-JAN-2005 3 03-JAN-2005 04-JAN-2005 4 04-JAN-2005 05-JAN-2005
Now PROJ_ID 4 is included in the final result set, and only the evil PROJ_ID 5 is excluded. Please consider your exact requirements when applying these recipes to your code.
You want to return the DEPTNO, ENAME, and SAL of each employee along with the difference in SAL between employees in the same department (i.e., having the same value for DEPTNO). The difference should be between each current employee and the employee hired immediately afterwards (you want to see if there is a correlation between seniority and salary on a “per department” basis). For each employee hired last in his department, return “N/A” for the difference. The result set should look like this:
DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 N/A 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 N/A
The is another example of where the window functions LEAD OVER and LAG OVER come in handy. You can easily access next and prior rows without additional joins. Alternative methods, such as subqueries or self-joins are possible but awkward.
1 with next_sal_tab (deptno,ename,sal,hiredate,next_sal) 2 as 3 (select deptno, ename, sal, hiredate, 4 lead(sal)over(partition by deptno 5 order by hiredate) as next_sal 6 from emp ) 7 8 select deptno, ename, sal, hiredate 9 , coalesce(cast(sal-next_sal as char), 'N/A') as diff 10 from next_sal_tab
In this case, for the sake of variety, we have used a common table expression rather than a subquery - both will work across most RDBMS’s these days, with the preference usually relating to readability.
The first step is to use the LEAD OVER window function to find the “next” salary for each employee within her department. The employees hired last in each department will have a NULL value for NEXT_SAL:
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) as next_sal
from emp
DEPTNO ENAME SAL HIREDATE NEXT_SAL ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 5000 10 KING 5000 17-NOV-1981 1300 10 MILLER 1300 23-JAN-1982 20 SMITH 800 17-DEC-1980 2975 20 JONES 2975 02-APR-1981 3000 20 FORD 3000 03-DEC-1981 3000 20 SCOTT 3000 09-DEC-1982 1100 20 ADAMS 1100 12-JAN-1983 30 ALLEN 1600 20-FEB-1981 1250 30 WARD 1250 22-FEB-1981 2850 30 BLAKE 2850 01-MAY-1981 1500 30 TURNER 1500 08-SEP-1981 1250 30 MARTIN 1250 28-SEP-1981 950 30 JAMES 950 03-DEC-1981
The next step is to take the difference between each employee’s salary and the salary of the employee hired immediately after her in the same department:
select deptno,ename,sal,hiredate, sal-next_sal diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) next_sal
from emp
)
DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981
The next step is to use the coalesce function to insert “N/A” when there is no next salary. To be able to return “N/A” you must cast the value of DIFF to a string:
select deptno,ename,sal,hiredate,
nvl(to_char(sal-next_sal),'N/A') diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) next_sal
from emp
)
DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- --------------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 N/A 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 N/A
While the majority of the solutions provided in this book do not deal with “what if” scenarios (for the sake of readability and the author’s sanity), the scenario involving duplicates when using the LEAD OVER function in this manner must be discussed. In the simple sample data in table EMP, no employees have duplicate HIREDATEs, yet this is a very likely situation. Normally, I would not discuss a “what if” situation such as duplicates (since there aren’t any in table EMP), but the workaround involving LEAD may not be immediately obvious. Consider the following query, which returns the difference in SAL between the employees in DEPTNO 10 (the difference is performed in the order in which they were hired):
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) next_sal
from emp
where deptno=10 and empno > 10
)
DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A
This solution is correct considering the data in table EMP but, if there were duplicate rows, the solution would fail. Consider the example below, showing four more employees hired on the same day as KING:
insert into emp (empno,ename,deptno,sal,hiredate)
values (1,'ant',10,1000,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (2,'joe',10,1500,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (3,'jim',10,1600,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (4,'jon',10,1700,to_date('17-NOV-1981'))
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) next_sal
from emp
where deptno=10
)
DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 1450 10 ant 1000 17-NOV-1981 -500 10 joe 1500 17-NOV-1981 -3500 10 KING 5000 17-NOV-1981 3400 10 jim 1600 17-NOV-1981 -100 10 jon 1700 17-NOV-1981 400 10 MILLER 1300 23-JAN-1982 N/A
You’ll notice that with the exception of employee JON, all employees hired on the same date (November 17) evaluate their salary against another employee hired on the same date! This is incorrect. All employees hired on November 17 should have the difference of salary computed against MILLER’s salary, not another employee hired on November 17. Take, for example, employee ANT. The value for DIFF for ANT is–500 because ANT’s SAL is compared with JOE’s SAL and is 500 less than JOE’s SAL, hence the value of–500. The correct value for DIFF for employee ANT should be–300 because ANT makes 300 less than MILLER, who is the next employee hired by HIREDATE. The reason the solution seems to not work is due to the default behavior of Oracle’s LEAD OVER function. By default, LEAD OVER only looks ahead one row. So, for employee ANT, the next SAL based on HIREDATE is JOE’s SAL, because LEAD OVER simply looks one row ahead and doesn’t skip duplicates. Fortunately, Oracle planned for such a situation and allows you to pass an additional parameter to LEAD OVER to determine how far ahead it should look. In the example above, the solution is simply a matter of counting: find the distance from each employee hired on November 17 to January 23 (MILLER’s HIREDATE). The solution below shows how to accomplish this:
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal,cnt-rn+1)over(partition by deptno
order by hiredate) next_sal
from (
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate order by sal) rn
from emp
where deptno=10
)
)
DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 1450 10 ant 1000 17-NOV-1981 -300 10 joe 1500 17-NOV-1981 200 10 jim 1600 17-NOV-1981 300 10 jon 1700 17-NOV-1981 400 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A
Now the solution is correct. As you can see, all the employees hired on November 17 now have their salaries compared with MILLER’s salary. Inspecting the results, employee ANT now has a value of–300 for DIFF, which is what we were hoping for. If it isn’t immediately obvious, the expression passed to LEAD OVER; CNT-RN+1 is simply the distance from each employee hired on November 17 to MILLER. Consider the inline view below, which shows the values for CNT and RN:
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate order by sal) rn
from emp
where deptno=10
DEPTNO ENAME SAL HIREDATE CNT RN ------ ------ ----- ----------- ---------- ---------- 10 CLARK 2450 09-JUN-1981 1 1 10 ant 1000 17-NOV-1981 5 1 10 joe 1500 17-NOV-1981 5 2 10 jim 1600 17-NOV-1981 5 3 10 jon 1700 17-NOV-1981 5 4 10 KING 5000 17-NOV-1981 5 5 10 MILLER 1300 23-JAN-1982 1 1
The value for CNT represents, for each employee with a duplicate HIREDATE, how many duplicates there are in total for their HIREDATE. The value for RN represents a ranking for the employees in DEPTNO 10. The rank is partitioned by DEPTNO and HIREDATE so only employees with a HIREDATE that another employee has will have a value greater than one. The ranking is sorted by SAL (this is arbitrary; SAL is convenient, but we could have just as easily chosen EMPNO). Now that you know how many total duplicates there are and you have a ranking of each duplicate, the distance to MILLER is simply the total number of duplicates minus the current rank plus one (CNT-RN+1). The results of the distance calculation and its effect on LEAD OVER are shown below:
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) incorrect,
cnt-rn+1 distance,
lead(sal,cnt-rn+1)over(partition by deptno
order by hiredate) correct
from (
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate
order by sal) rn
from emp
where deptno=10
)
DEPTNO ENAME SAL HIREDATE INCORRECT DISTANCE CORRECT ------ ------ ----- ----------- ---------- ---------- ---------- 10 CLARK 2450 09-JUN-1981 1000 1 1000 10 ant 1000 17-NOV-1981 1500 5 1300 10 joe 1500 17-NOV-1981 1600 4 1300 10 jim 1600 17-NOV-1981 1700 3 1300 10 jon 1700 17-NOV-1981 5000 2 1300 10 KING 5000 17-NOV-1981 1300 1 1300 10 MILLER 1300 23-JAN-1982 1
Now you can clearly see the effect that you have when you pass the correct distance to LEAD OVER. The rows for INCORRECT represent the values returned by LEAD OVER using a default distance of one. The rows for CORRECT represent the values returned by LEAD OVER using the proper distance for each employee with a duplicate HIREDATE to MILLER. At this point, all that is left is to find the difference between CORRECT and SAL for each row, which has already been shown.
This recipe is an extension of the prior recipe , and it uses the same view V from the prior recipe. Now that you’ve located the ranges of consecutive values, you want to find just their start and end points. Unlike the prior recipe, if a row is not part of a set of consecutive values, you still want to return it. Why? Because such a row represents both the beginning and end of its range. Using the data from view V:
select *
from V
PROJ_ID PROJ_START PROJ_END ------- ----------- ----------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 3 03-JAN-2020 04-JAN-2020 4 04-JAN-2020 05-JAN-2020 5 06-JAN-2020 07-JAN-2020 6 16-JAN-2020 17-JAN-2020 7 17-JAN-2020 18-JAN-2020 8 18-JAN-2020 19-JAN-2020 9 19-JAN-2020 20-JAN-2020 10 21-JAN-2020 22-JAN-2020 11 26-JAN-2020 27-JAN-2020 12 27-JAN-2020 28-JAN-2020 13 28-JAN-2020 29-JAN-2020 14 29-JAN-2020 30-JAN-2020
you want the final result set to be:
PROJ_GRP PROJ_START PROJ_END -------- ----------- ----------- 1 01-JAN-2020 05-JAN-2020 2 06-JAN-2020 07-JAN-2020 3 16-JAN-2020 20-JAN-2020 4 21-JAN-2020 22-JAN-2020 5 26-JAN-2020 30-JAN-2020
This problem is a bit more involved than its predecessor. First, you must identify what the ranges are. A range of rows is defined by the values for PROJ_START and PROJ_END. For a row to be considered “consecutive” or part of a group, its PROJ_ START value must equal the PROJ_END value of the row before it. In the case where a row’s PROJ_START value does not equal the prior row’s PROJ_END value and its PROJ_END value does not equal the next row’s PROJ_START value, this is an instance of a single row group. Once you have identify the ranges, you need to be able to group the rows in these ranges together (into groups) and return only their start and end points.
Examine the first row of the desired result set. The PROJ_START is the PROJ_ START for PROJ_ID 1 from view V and the PROJ_END is the PROJ_END for PROJ_ID 4 from view V. Despite the fact that PROJ_ID 4 does not have a consecutive value following it, it is the last of a range of consecutive values, and thus it is included in the first group.
The most straight forward approach for this problem is to use the LAG OVER window function. Use LAG OVER to determine whether or not each prior row’s PROJ_END equals the current row’s PROJ_START to help place the rows into groups. Once they are grouped, use the aggregate functions MIN and MAX to find their start and end points:
1 select proj_grp, min(proj_start), max(proj_end) 2 from ( 3 select proj_id,proj_start,proj_end, 4 sum(flag)over(order by proj_id) proj_grp 5 from ( 6 select proj_id,proj_start,proj_end, 7 case when 8 lag(proj_end)over(order by proj_id) = proj_start 9 then 0 else 1 10 end flag 11 from V 12 ) alias1 13 ) alias2 14 group by proj_grp
==== Discussion
The window function LAG OVER is extremely useful in this situation. You can examine each prior row’s PROJ_END value without a self join, without a scalar sub-query, and without a view. The results of the LAG OVER function without the CASE expression are as follows:
select proj_id,proj_start,proj_end,
lag(proj_end)over(order by proj_id) prior_proj_end
from V
PROJ_ID PROJ_START PROJ_END PRIOR_PROJ_END ------- ----------- ----------- -------------- 1 01-JAN-2020 02-JAN-2020 2 02-JAN-2020 03-JAN-2020 02-JAN-2020 3 03-JAN-2020 04-JAN-2020 03-JAN-2020 4 04-JAN-2020 05-JAN-2020 04-JAN-2020 5 06-JAN-2020 07-JAN-2020 05-JAN-2020 6 16-JAN-2020 17-JAN-2020 07-JAN-2020 7 17-JAN-2020 18-JAN-2020 17-JAN-2020 8 18-JAN-2020 19-JAN-2020 18-JAN-2020 9 19-JAN-2020 20-JAN-2020 19-JAN-2020 10 21-JAN-2020 22-JAN-2020 20-JAN-2020 11 26-JAN-2020 27-JAN-2020 22-JAN-2020 12 27-JAN-2020 28-JAN-2020 27-JAN-2020 13 28-JAN-2020 29-JAN-2020 28-JAN-2020 14 29-JAN-2020 30-JAN-2020 29-JAN-2020
The CASE expression in the complete solution simply compares the value returned by LAG OVER to the current row’s PROJ_START value; if they are the same, return 0, else return 1. The next step is to create a running total on the 0’s and 1’s returned by the CASE expression to put each row into a group. The results of the running total can be seen below:
select proj_id,proj_start,proj_end,
sum(flag)over(order by proj_id) proj_grp
from (
select proj_id,proj_start,proj_end,
case when
lag(proj_end)over(order by proj_id) = proj_start
then 0 else 1
end flag
from V
)
PROJ_ID PROJ_START PROJ_END PROJ_GRP ------- ----------- ----------- ---------- 1 01-JAN-2020 02-JAN-2020 1 2 02-JAN-2020 03-JAN-2020 1 3 03-JAN-2020 04-JAN-2020 1 4 04-JAN-2020 05-JAN-2020 1 5 06-JAN-2020 07-JAN-2020 2 6 16-JAN-2020 17-JAN-2020 3 7 17-JAN-2020 18-JAN-2020 3 8 18-JAN-2020 19-JAN-2020 3 9 19-JAN-2020 20-JAN-2020 3 10 21-JAN-2020 22-JAN-2020 4 11 26-JAN-2020 27-JAN-2020 5 12 27-JAN-2020 28-JAN-2020 5 13 28-JAN-2020 29-JAN-2020 5 14 29-JAN-2020 30-JAN-2020 5
Now that each row has been placed into a group, simply use the aggregate functions MIN and MAX on PROJ_START and PROJ_END respectively, and group by the values created in the PROJ_GRP running total column.
You want to return the number of employees hired each year for the entire decade of the 1980s, but there are some years in which no employees were hired. You would like to return the following result set:
YR CNT ---- ---------- 1980 1 1981 10 1982 2 1983 1 1984 0 1985 0 1986 0 1987 0 1988 0 1989 0
The trick to this solution is returning zeros for years that saw no employees hired. If no employee was hired in a given year, then no rows for that year will exist in table EMP. If the year does not exist in the table, how can you return a count, any count, even zero? The solution requires you to outer join. You must supply a result set that returns all the years you want to see, and then perform a count against table EMP to see if there were any employees hired in each of those years.
Use table EMP as a pivot table (because it has 14 rows) and the built-in function YEAR to generate one row for each year in the decade of 1980. Outer join to table EMP and count how many employees were hired each year:
1 select x.yr, coalesce(y.cnt,0) cnt 2 from ( 3 select year(min(hiredate)over()) - 4 mod(year(min(hiredate)over()),10) + 5 row_number()over()-1 yr 6 from emp fetch first 10 rows only 7 ) x 8 left join 9 ( 10 select year(hiredate) yr1, count(*) cnt 11 from emp 12 group by year(hiredate) 13 ) y 14 on ( x.yr = y.yr1 )
1 select x.yr, coalesce(cnt,0) cnt 2 from ( 3 select extract(year from min(hiredate)over()) - 4 mod(extract(year from min(hiredate)over()),10) + 5 rownum-1 yr 6 from emp 7 where rownum <= 10 8 ) x 9 left join 10 ( 11 select to_number(to_char(hiredate,'YYYY')) yr, count(*) cnt 12 from emp 13 group by to_number(to_char(hiredate,'YYYY')) 14 ) y 15 on ( x.yr = y.yr )
Use table T10 as a pivot table (because it has 10 rows) and the built-in function EXTRACT to generate one row for each year in the decade of 1980. Outer join to table EMP and count how many employees were hired each year:
1 select y.yr, coalesce(x.cnt,0) as cnt 2 from ( 3 selectmin_year-mod(cast(min_year as int),10)+rn as yr 4 from ( 5 select (select min(extract(year from hiredate)) 6 from emp) as min_year, 7 id-1 as rn 8 from t10 9 ) a 10 ) y 11 left join 12 ( 13 select extract(year from hiredate) as yr, count(*) as cnt 14 from emp 15 group by extract(year from hiredate) 16 ) x 17 on ( y.yr = x.yr )
Use table EMP as a pivot table (because it has 14 rows) and the built-in function YEAR to generate one row for each year in the decade of 1980. Outer join to table EMP and count how many employees were hired each year:
1 select x.yr, coalesce(y.cnt,0) cnt 2 from ( 3 select top (10) 4 (year(min(hiredate)over()) - 5 year(min(hiredate)over())%10)+ 6 row_number()over(order by hiredate)-1 yr 7 from emp 8 ) x 9 left join 10 ( 11 select year(hiredate) yr, count(*) cnt 12 from emp 13 group by year(hiredate) 14 ) y 15 on ( x.yr = y.yr )
Despite the difference in syntax, the approach is the same for all solutions. Inline view X returns each year in the decade of the ’80s by first finding the year of the earliest HIREDATE. The next step is to add RN–1 to the difference between the earliest year and the earliest year modulus ten. To see how this works, simply execute inline view X and return each of the values involved separately. Listed below is the result set for inline view X using the window function MIN OVER (DB2, Oracle, SQL Server) and a scalar subquery (MySQL, PostgreSQL):
select year(min(hiredate)over()) -
mod(year(min(hiredate)over()),10) +
row_number()over()-1 yr,
year(min(hiredate)over()) min_year,
mod(year(min(hiredate)over()),10) mod_yr,
row_number()over()-1 rn
from emp fetch first 10 rows only
YR MIN_YEAR MOD_YR RN ---- ---------- ---------- ---------- 1980 1980 0 0 1981 1980 0 1 1982 1980 0 2 1983 1980 0 3 1984 1980 0 4 1985 1980 0 5 1986 1980 0 6 1987 1980 0 7 1988 1980 0 8 1989 1980 0 9select min_year-mod(min_year,10)+rn as yr,
min_year,
mod(min_year,10) as mod_yr
rn
from (
select (select min(extract(year from hiredate))
from emp) as min_year,
id-1 as rn
from t10
) x
YR MIN_YEAR MOD_YR RN ---- ---------- ---------- ---------- 1980 1980 0 0 1981 1980 0 1 1982 1980 0 2 1983 1980 0 3 1984 1980 0 4 1985 1980 0 5 1986 1980 0 6 1987 1980 0 7 1988 1980 0 8 1989 1980 0 9
Inline view Y returns the year for each HIREDATE and the number of employees hired during that year:
select year(hiredate) yr, count(*) cnt
from emp
group by year(hiredate)
YR CNT ----- ---------- 1980 1 1981 10 1982 2 1983 1
Finally, outer join inline view Y to inline view X so that every year is returned even if there are no employees hired.
You would like to have a “row source generator” available to you in your queries. Row source generators are useful for queries that require pivoting. For example, you want to return a result set such as the following, up to any number of rows that you specify:
ID --- 1 2 3 4 5 6 7 8 9 10 …
If your RDBMS provides built-in functions for returning rows dynamically, you do not need to create a pivot table in advance with a fixed number of rows. That’s why a dynamic row generator can be so handy. Otherwise, you must use a traditional pivot table with a fixed number of rows (that may not always be enough) to generate rows when needed.
This solution shows how to return 10 rows of increasing numbers starting from 1. You can easily adapt the solution to return any number of rows.
The ability to return increasing values from 1 opens the door to many other solutions. For example, you can generate numbers to add to dates in order to generate sequences of days. You can also use such numbers to parse through strings.
Use the recursive WITH clause to generate a sequence of rows with incrementing values. Using a recursive CTE will in fact work with the majority of RDBMS’s today.
1 with x (id) 2 as ( 3 select 1 4 union all 5 select id+1 6 from x 7 where id+1 <= 10 8 ) 9 select * from x
In Oracle Database you can generate rows using the MODEL clause:
1 select array id 2 from dual 3 model 4 dimension by (0 idx) 5 measures(1 array) 6 rules iterate (10) ( 7 array[iteration_number] = iteration_number+1 8 )
Use the very handy function GENERATE_SERIES, which is designed for the express purpose of generating rows:
1 select id 2 from generate_series (1, 10) x(id)
The recursive WITH clause increments ID (which starts at 1) until the WHERE clause is satisfied. To kick things off you must generate one row having the value 1. You can do this by selecting 1 from a one-row table or, in the case of DB2, by using the VALUES clause to create a one-row result set.
In the MODEL clause solution, there is an explicit ITERATE command that allows you to generate multiple rows. Without the ITERATE clause, only one row will be returned, since DUAL has only one row. For example:
select array id
from dual
model
dimension by (0 idx)
measures(1 array)
rules ()
ID -- 1
The MODEL clause not only allows you array access to rows, it allows you to easily “create” or return rows that are not in the table you are selecting against. In this solution, IDX is the array index (location of a specific value in the array) and ARRAY (aliased ID) is the “array” of rows. The first row defaults to 1 and can be referenced with ARRAY[0]. Oracle provides the function ITERATION_NUMBER so you can track the number of times you’ve iterated. The solution iterates 10 times, causing ITERATION_NUMBER to go from 0 to 9. Adding 1 to each of those values yields the results 1 through 10.
It may be easier to visualize what’s happening with the model clause if you execute the following query:
select 'array['||idx||'] = '||array as output
from dual
model
dimension by (0 idx)
measures(1 array)
rules iterate (10) (
array[iteration_number] = iteration_number+1
)
OUTPUT ------------------ array[0] = 1 array[1] = 2 array[2] = 3 array[3] = 4 array[4] = 5 array[5] = 6 array[6] = 7 array[7] = 8 array[8] = 9 array[9] = 10
All the work is done by the function GENERATE_SERIES. The function accepts three parameters, all numeric values. The first parameter is the start value, the second parameter is the ending value, and the third parameter is an optional “step” value (how much each value is incremented by). If you do not pass a third parameter, the increment defaults to 1.
The GENERATE_SERIES function is flexible enough so that you do not have to hardcode parameters. For example, if you wanted to return five rows starting from value 10 and ending with value 30, incrementing by 5 such that the result set is the following:
ID --- 10 15 20 25 30
you can be creative and do something like this:
select id from generate_series( (select min(deptno) from emp), (select max(deptno) from emp), 5 ) x(id)
Notice here that the actual values passed to GENERATE_SERIES are not known when the query is written. Instead, they are generated by subqueries when the main query executes.
Queries that take into account ranges are one of the most common requests from business users - they are a natural consquence of the way that businesses operate. At least some of the time, however, a degree of dexterity is needed to apply the range correctly - the recipes in this chapter should demonstrate how to apply that dexterity.