Advanced aggregation

There are several aggregating functions that are executed in a special way.

The first group of such aggregating functions are called ordered-set aggregates. They take into account not just the values of the argument expressions, but also their order. They are related to statistics and calculate percentile values.

The percentile is the value of a group in which the given percentage of other values is less than that. For example, if a value is at the 95th percentile, it means that it is higher than 95 percent of the other values. In PostgreSQL, one can calculate a continuous or discrete percentile using the functions percentile_cont or percentile_disc respectively. A discrete percentile is one of the actual values of a group, while a continuous percentile is an interpolated value between two actual values. It is possible to calculate the percentile for a given fraction, or several percentile values for a given array of fractions.

For example, this is the query regarding the distribution of the number of advertisements per car:

car_portal=> SELECT percentile_disc(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY cnt)
  FROM (SELECT count(*) cnt FROM car_portal_app.advertisement GROUP BY car_id) subq;
 percentile_disc
-----------------
 {2,3,5}
(1 row)

The result means that there are, at most, two advertisements for 25 percent of the cars, three advertisements for 50 percent of the cars, and five advertisements for 75 percent of the cars in the database.

The syntax of the ordered-set aggregating functions differs from the normal aggregates and uses a special construct, WITHIN GROUP (ORDER BY expression). The expression here is actually an argument of a function. Here, not just the order of the rows but the values of the expressions as well affect the result of the function. In contrast to the ORDER BY clause of the SELECT query, only one expression is possible here and no references to the output column numbers are allowed.

Another ordered-set aggregating function is mode. It returns the most frequently appearing value of the group. If two values appear the same number of times, then the first of them will be returned.

For example, the following query gets the ID of the most frequent car model in the database:

car_portal=> SELECT mode() WITHIN GROUP (ORDER BY car_model_id) FROM car_portal_app.car;
 mode
------
   64
(1 row)

To get the same result without this function will require self-join or ordering and limiting the result. Both are more expensive operations.

Another group of aggregates that use the same syntax are the hypothetical-set aggregating functions. They are rank, dense_rank, percent_rank, and cume_dist. There are window functions with the same names. Window functions take no argument and they return the result for the current row. Aggregate functions have no current row because they are evaluated for a group of rows. However, they take an argument: the value for the hypothetical current row.

For example, the aggregate function rank returns the rank of a given value in the ordered set as if that value existed in the set:

car_portal=> SELECT rank(2) WITHIN GROUP (ORDER BY a) FROM generate_series(1,10,3) a;
 rank
------
    2
(1 row)

In the preceding query, the value 2 does not exist in the output of generate_series (it returns 1..4..7..10). If it existed, it would take the second position in the output.

Another topic worth mentioning about aggregating functions is the FILTER clause.

The FILTER clause filters the rows that are passed to the particular aggregating function based on a given condition. For example, suppose it is required to count the number of cars in the database for each car model separately, for each number of doors. If one groups the records by these two fields, the result will be correct but not very convenient to use in reporting:

car_portal=> SELECT car_model_id, number_of_doors, count(*)
  FROM car_portal_app.car
  GROUP BY car_model_id, number_of_doors;
 car_model_id | number_of_doors | count
--------------+-----------------+-------
           47 |               4 |     1
           42 |               3 |     2
           76 |               5 |     1
           52 |               5 |     2
...

The FILTER clause makes the output much clearer:

car_portal=> SELECT car_model_id,
    count(*) FILTER (WHERE number_of_doors = 2) doors2,
    count(*) FILTER (WHERE number_of_doors = 3) doors3,
    count(*) FILTER (WHERE number_of_doors = 4) doors4,
    count(*) FILTER (WHERE number_of_doors = 5) doors5
  FROM car_portal_app.car GROUP BY car_model_id;
 car_model_id | doors2 | doors3 | doors4 | doors5
--------------+--------+--------+--------+--------
           43 |      0 |      0 |      0 |      2
            8 |      0 |      0 |      1 |      0
           11 |      0 |      2 |      1 |      0
           80 |      0 |      1 |      0 |      0
...

Note that the cars with a number of doors other than from 2 to 5 will not be counted by the query.

The same result can be achieved by calling functions, as follows:

count(CASE WHEN number_of_doors = 2 THEN 1 END) doors2

However, the FILTER clause is easier and shorter.

Table of Contents for Advanced aggregation

Create new playlist

Sign In

Sign Up

Table of Contents for
Advanced aggregation