Chapter 6. Advanced Query Writing

In this chapter, we will discuss some SQL features supported by PostgreSQL, that were not covered in the previous chapters. Some advanced techniques will also be described.

The same sample database as was used in the previous chapters is used in the code examples. It is recommended to recreate the sample database in order to get the same results as shown in the examples.

In this chapter, we are going to cover the following topics:

  • Common table expression
  • Window functions
  • Advanced SQL
  • Transaction isolation and Multiversion Concurrency Control

The code examples in this chapter are based on the car portal database described in the previous chapters. The scripts for creating the database and filling it with data (schema.sql and data.sql) can be found in the code bundle in the Chapter 6 folder.

All the code examples of this chapter can be found in the file examples.sql.

Common table expressions

Although SQL is a declarative language, it provides a way of implementing the logic of sequential execution of code or of reusing code. Common table expressions (CTE) is a feature that makes it possible to define a subquery, give it a name, and then use it at several places in the main query.

The simplified syntax diagram for CTE is as follows:

WITH <subquery name> AS (
<subquery code>)
[, ...]
SELECT <Select list> FROM <subquery name>;

In the preceding syntax diagram, subquery code is a query that returns results of which will be used in the primary query as if it was a real table. The subquery in parenthesis, after the keyword AS, is a common table expression. It can also be called a substatement or an auxiliary statement. The query after the WITH block is the primary or main query. The whole statement itself is called the WITH query. It is possible to use not only the SELECT statements in CTE but also the INSERT, UPDATE, or DELETE statements.

It is possible to use several CTEs in one WITH query. Every CTE has its name defined before the keyword AS. The main query can reference a CTE by its name. A CTE can also refer to another CTE by name. A CTE can refer only to the CTEs that were written before the referencing one.

The references to CTEs in the primary query can be treated as table names. In fact, PostgreSQL executes the CTEs only once, stores the results, and reuses them instead of executing subqueries each time they occur in the main query. This makes them similar to tables.

CTEs may help developers in organizing SQL code, as follows:

WITH pre_select AS
(
  SELECT car_id, number_of_owners, car_model_id
    FROM car_portal_app.car
    WHERE manufacture_year >= 2010
),
joined_data AS
(
  SELECT car_id, marke, model, number_of_owners
    FROM pre_select
      INNER JOIN car_portal_app.car_model
        ON  pre_select.car_model_id= car_model.car_model_id
),
minimal_owners AS
(
  SELECT min(number_of_owners) AS min_number_of_owners
    FROM pre_select
)
SELECT car_id, marke, model, number_of_owners
  FROM joined_data
    INNER JOIN minimal_owners
      ON joined_data.number_of_owners =
        minimal_owners.min_number_of_owners;
car_id |     marke     |   model    | number_of_owners
--------+---------------+------------+------------------
      2 | Opel          | Corsa      |                1
      3 | Citroen       | C3         |                1
     11 | Nissan        | GT-R       |                1
     36 | KIA           | Magentis   |                1
…
(25 rows)

In the preceding example, the logic part of the query is presented as a sequence of actions—filtering in pre_select and then joining in joined_data. The other part, that is, calculating the minimal number of owners, is executed in a dedicated subquery, minimal_owners. This makes the implementation of the query logic similar to that of an imperative programming language.

The use of CTEs in the preceding example does not make the whole query faster; however, there are situations where the use of CTEs can increase performance. Moreover, sometimes it is not possible to implement the logic in any other way but by using CTEs. In the following sections, some of those situations are discussed in detail.

The order of execution of the CTEs is not defined. PostgreSQL aims to execute only the main query. If the main query contains references to the CTEs, then PostgreSQL will execute them first. If the SELECT CTE is not referenced by the main query, directly or indirectly, then it is not executed at all. Data-changing sub statements are always executed.

Reusing SQL code with CTE

When the execution of a subquery takes too much time, and the subquery is used in the statement more than once, it makes sense to put it in the WITH clause to reuse its results. This makes the query faster, because PostgreSQL executes the subqueries from the WITH clause only once, saves the results in the memory or on disk, depending on their size, and then reuses them.

For example, let's take the car portal database. Suppose it is required to find newer car models. That would require the calculation of the average age of each model of the cars, and then selecting the models with an age lower than the average age among all the models.

This can be done in the following way:

SELECT marke, model, avg_age
  FROM (
    SELECT car_model_id,
        avg(EXTRACT(YEAR FROM now())-manufacture_year)
          AS avg_age
      FROM car_portal_app.car
      GROUP BY car_model_id
    ) age_subq1
    INNER JOIN car_portal_app.car_model
      ON car_model.car_model_id = age_subq1.car_model_id
  WHERE avg_age < (SELECT avg(avg_age) FROM
        (
          SELECT avg(EXTRACT(YEAR FROM now())
              -manufacture_year) avg_age
            FROM car_portal_app.car
            GROUP BY car_model_id
        ) age_subq2
      );
     marke     |    model     |     avg_age      
---------------+--------------+------------------
 BMW           | 1er          |                1
 BMW           | X6           |              2.5
 Mercedes Benz | A klasse     |              1.5
…
(41 rows)

The function EXTRACT used in the query returns the integer value of the given part of the date expression. In the preceding example, the function is used to retrieve the year from the current date. The difference between the current year and manufacture_year is the age of a car.

There are two subqueries that are similar: the same table is queried and the same grouping and aggregation are performed. That makes it possible to use the same subquery twice when it is used as a CTE:

WITH age_subq AS
(
  SELECT car_model_id,
      avg(EXTRACT(YEAR FROM now())-manufacture_year)
        AS avg_age
    FROM car_portal_app.car
    GROUP BY car_model_id
)
SELECT marke, model, avg_age
  FROM age_subq
    INNER JOIN car_portal_app.car_model
      ON car_model.car_model_id = age_subq.car_model_id
  WHERE avg_age < (SELECT avg(avg_age) FROM age_subq);
     marke     |    model     |     avg_age      
---------------+--------------+------------------
 BMW           | 1er          |                1
 BMW           | X6           |              2.5
 Mercedes Benz | A klasse     |              1.5
...
(41 rows)

The result of both the queries is the same. But the first query took 1.9 milliseconds to execute and the second one took 1.0 milliseconds. Of course, in absolute values the difference is nothing. But relatively, the WITH query is almost twice as fast. And if the number of records in the tables was in millions, then the absolute difference would be substantial.

Another advantage of using a CTE in this case is that the code became smaller and easier. That is another use case for the WITH clause. Long and complicated subqueries can be formatted as CTEs in order to make the whole query smaller and more understandable, even if it does not affect the performance.

Sometimes, it is better not to use CTE. For example, one could decide to preselect some columns from the table thinking it would help the database to perform the query because of the reduced amount of information to process. In that case, the query would be the following:

WITH car_subquery AS
(
  SELECT number_of_owners, manufacture_year,
      number_of_doors
    FROM car_portal_app.car
)
SELECT number_of_owners, number_of_doors
  FROM car_subquery
  WHERE manufacture_year = 2008;

But that has the opposite effect. PostgreSQL does not push the WHERE clause from the primary query to the substatement. The database will retrieve all the records from the table, take three columns from them, and store this temporary dataset in the memory. Then, the temporary data will be queried using the predicate, manufacture_year = 2008. If there was an index on manufacture_year, it would not be used because it is the temporary data being queried and not the real table.

For that reason, the following query is executed five times faster than the preceding one even though it seems almost the same:

SELECT number_of_owners, number_of_doors
  FROM car_portal_app.car
  WHERE manufacture_year = 2008;

Recursive and hierarchical queries

It is possible to refer a substatement from itself. Statements like that are called recursive queries. Recursive queries must have a special structure that indicates to the database that the subquery is recursive. The structure of a recursive query is as follows:

WITH RECURSIVE <subquery_name> (<field list>) AS
(
<non-recursive term>
UNION [ALL|DISTINCT]
<recursive term>
)  
[,...]
<main query>

Both, non-recursive and recursive terms, are subqueries that must return the same number of fields of the same types. The names for the fields are specified in the declaration of the whole recursive query; therefore, it does not matter which names are assigned to the fields in the subqueries.

The non-recursive term is also called an anchor subquery while the recursive term is also known as an iterating subquery.

A non-recursive or anchor subquery is a starting point of the execution of a recursive query. It cannot refer to the name of the recursive query. It is executed only once. The results of the non-recursive term are passed to the recursive term, which can reference the whole recursive query by its name. If the recursive term returns rows, they are passed to the recursive term again. This is called iteration. Iteration is repeated as long as the result of the recursive term is not empty. The result of the whole query is all the rows returned by the non-recursive term and all the iterations of the recursive term. If the keywords UNION ALL are used, then all the rows are returned. If UNION DISTINCT or just UNION is used, then the duplicated rows are removed from the result set.

Note that the algorithm of a recursive subquery implies iterations but not recursion. However, in SQL standard, these kind of queries are called recursive. In other databases, the same logic can be implemented in a similar manner, but the syntax can be slightly different.

For example, the following recursive query can be used to calculate factorial values:

WITH RECURSIVE subq (n, factorial) AS
(
  SELECT 1, 1
  UNION ALL
  SELECT n+1, factorial*(n+1) from subq WHERE n <5
)
SELECT * FROM subq;
n | factorial
---+-----------
 1 |         1
 2 |         2
 3 |         6
 4 |        24
 5 |       120
(5 rows)

Here, SELECT 1, 1 is the anchor subquery. It returns one row (the fields n and factorial have values of 1 and 1), which is passed to the subsequent iterating subquery. The first iteration adds one to the value of the field n and multiplies the value of the factorial by (n+1), which gives the values 2 and 2; it then passes the result row to the next iteration. The second iteration returns the values 3 and 6, and so on. The last iteration returns a row where the value of the field n equals 5. That row is filtered out in the WHERE clause of the iterating subquery; that's why the following iteration returns nothing and the execution stops at that point. So, the whole recursive subquery returns a list of five numbers, from 1 to 5, and their factorial values.

Note that if no WHERE clause was specified, the execution would never stop, which would cause an error at the end.

The preceding example is quite easy to implement without recursive queries. PostgreSQL provides a way to generate a series of numeric values and use them in subqueries. But there is a task that cannot be solved without recursive queries, and that is: querying a hierarchy of objects.

There is no hierarchical data in the car portal database, so in order to illustrate the technique, we need to create some sample data. A typical hierarchy implies the existence of a parent-child relationship between objects, where an object can be a parent and a child at the same time. Suppose there is a family: Alan has two children, Bert and Bob. Bert also has two children: Carl and Carmen. Bob has one child: Cecil, who has two children, Dave and Den. The relationships are shown in the next image:

Recursive and hierarchical queries

An hierarchical relationship

In the database, the hierarchy can be stored in a simple table of two columns: parent and child.

CREATE TABLE family (parent text, child text);
INSERT INTO family VALUES (NULL, 'Alan'),
('Alan', 'Bert'), ('Alan', 'Bob'), ('Bert', 'Carl'), ('Bert', 'Carmen'), ('Bob', 'Cecil'), ('Cecil', 'Dave'), ('Cecil', 'Den');

The first inserted record with a NULL value for parent indicates that there is no information about Alan's parent. The table is not normalized, but it does not matter for the example.

Suppose it is required to build a full bloodline for all the children in the family. It is not possible to do so by just joining tables, because each join will process only one level of hierarchy, but in general, the number or levels is not given.

The following recursive query will solve the problem:

WITH RECURSIVE genealogy (bloodline, parent, level) AS
(
  SELECT child, child, 0
    FROM family WHERE parent IS NULL
  UNION ALL
  SELECT g.bloodline || ' -> ' || f.child, f.child,
      g.level + 1
    FROM family f, genealogy g
    WHERE f.parent = g.parent
)
SELECT bloodline, level FROM genealogy;
          bloodline           | level
------------------------------+-------
 Alan                         |     0
 Alan -> Bert                 |     1
 Alan -> Bob                  |     1
 Alan -> Bert -> Carl         |     2
 Alan -> Bert -> Carmen       |     2
 Alan -> Bob -> Cecil         |     2
 Alan -> Bob -> Cecil -> Dave |     3
 Alan -> Bob -> Cecil -> Den  |     3
(8 rows)

In the non-recursive term, the start of the hierarchy is selected. Here it is a child who has no parent. His name is an initial value for the bloodline. On the first iteration in the recursive term, his children are selected. Their names are added to the bloodline field with a separator ->. On the second iteration, the children of the children are selected, and so on. When no more children are found, the execution stops. The value in the field level is incremented on each iteration so that the number of iterations is visible in the results.

There is a potential problem with those hierarchical queries. If the data contained cycles, the recursive query would never stop if written in the same way as the preceding code. For example, let's add another record into the table family:

INSERT INTO family VALUES ('Bert', 'Alan');

Now there is a cycle in the data: Alan is a child of his own child. To run the query, it is necessary to somehow make the query stop. That can be done by checking if the child being processed is already included in the bloodline, as follows:

WITH RECURSIVE genealogy
  (bloodline, parent, level, processed) AS
(
  SELECT child, child, 0, ARRAY[child]
    FROM family WHERE parent IS NULL
  UNION ALL
  SELECT g.bloodline || ' -> ' || f.child,
      f.child, g.level + 1, processed || f.child
    FROM family f, genealogy g
    WHERE f.parent = g.parent
      AND NOT f.child = ANY(processed)
)
SELECT bloodline, level FROM genealogy;
          bloodline           | level
------------------------------+-------
 Alan                         |     0
 Alan -> Bert                 |     1
 Alan -> Bob                  |     1
 Alan -> Bert -> Carl         |     2
 Alan -> Bert -> Carmen       |     2
 Alan -> Bob -> Cecil         |     2
 Alan -> Bob -> Cecil -> Dave |     3
 Alan -> Bob -> Cecil -> Den  |     3
(8 rows)

The result is the same as in the previous example. The field processed is an array which contains the names of all the processed children. In fact, it has the same data as the field bloodline, but in a way that is more efficient to analyze. In each iteration, the name of the processed child is added to the array. Additionally, in the WHERE clause of the recursive term, the name of the child is checked so that it is not equal to any element of the array.

There are some limitations to the implementation of recursive queries. The use of aggregation is not allowed in the recursive term. Moreover, the name of the recursive subquery can be referenced only once in the recursive term.

Changing data in multiple tables at a time

Another very useful application of CTEs is performing several data-changing statements at once. This is done by including the INSERT, UPDATE, and DELETE statements in the CTEs. The results of any of those statements can be passed to the following CTEs or to the primary query by specifying the RETURNING clause. As well as for SELECT statements, the maximal number of changing data common table expressions is not defined.

For example, suppose one wants to add a new car to the car portal database and there is no corresponding car model in the car_model table. To do this, one needs to enter a new record in the car_model, take the ID of the new record, and use that ID to insert the data into the car table:

car_portal=# INSERT INTO car_portal_app.car_model (marke, model) VALUES ('Ford','Mustang') RETURNING car_model_id;
car_model_id
--------------
          100
(1 row)
INSERT 0 1
car_portal=# INSERT INTO car_portal_app.car (number_of_owners, registration_number, manufacture_year, number_of_doors, car_model_id, mileage)
  VALUES (1, 'GTR1231', 2014, 4, 100, 10423);
INSERT 0 1

Sometimes, it is not convenient to perform two statements while storing the intermediate ID number somewhere. The WITH queries provide a way to make the changes in both the tables at the same time:

car_portal=# WITH car_model_insert AS
(
  INSERT INTO car_portal_app.car_model (marke, model)     
    VALUES ('Ford','Mustang') RETURNING car_model_id
)
INSERT INTO car_portal_app.car
  (number_of_owners, registration_number,
    manufacture_year, number_of_doors, car_model_id,
    mileage)
  SELECT 1, 'GTR1231', 2014, 4, car_model_id, 10423
    FROM car_model_insert;
INSERT 0 1

As mentioned earlier, the CTEs that change the data are always executed. It does not matter if they are referenced in the primary query directly or indirectly. However, the order of their execution is not determined. One can influence that order by making them dependent on each other.

What if several CTEs change the same table or use the results produced by each other? There are some principles for their isolation and interaction:

  • For sub-statements:
    • All the sub-statements work with the data as it was at the moment of the start of the whole WITH query.
    • They don't see the results of each other's work. For example, it is not possible for the DELETE sub-statement to remove a row that was inserted by another INSERT sub-statement.
    • The only way for passing information of the processed records from a data changing CTE to another CTE is by using the RETURNING clause.
  • For triggers defined on the tables being changed:
    • For the BEFORE triggers: Statement-level triggers are executed just before the execution of each sub-statements. Row-level triggers are executed just before the changing of every record. This means that a row-level trigger for one sub-statements can be executed before a statement-level trigger for another sub-statement even if the same table is changed.
    • For the AFTER triggers: Both, statement-level and row-level triggers are executed after the whole WITH query. They are executed in groups per every sub-statement: first row-level then statement-level. That means that a statement-level trigger for one sub-statements can be executed before a row-level trigger for another sub-statement, even if the same table is changed.
    • The statements inside the code of the triggers do see the changes in data that were made by other sub-statements.
  • For the constraints defined on the tables being changed, assuming they are not set to DEFERRED:
    • PRIMARY KEY and UNIQUE constraints are validated for every record at the moment of insert or update of the record. They take into account the changes made by other sub-statements.
    • CHECK constraints are validated for every record at the moment of insert or update of the record. They do not take into account the changes made by other sub-statements.
    • FOREIGN KEY constraints are validated at the end of the execution of the whole WITH query.

A simple example of dependency and interaction between CTEs would be as follows:

car_portal=# CREATE TABLE t (f int UNIQUE);
CREATE TABLE
car_portal=# INSERT INTO t VALUES (1);
INSERT 0 1
car_portal=# WITH del_query AS (DELETE FROM t)
INSERT INTO t VALUES (1);
ERROR:  duplicate key value violates unique constraint "t_f_key"

The last query failed because PostgreSQL tried to execute the main query before the CTE. But if one creates a dependency that will make the CTE execute first, then the record will be deleted and the new record will be inserted. In that case, the constraint will not be violated:

car_portal=#  WITH del_query AS
(DELETE FROM t RETURNING f)
INSERT INTO t SELECT 1
WHERE (SELECT count(*) FROM del_query) IS NOT NULL;
INSERT 0 1

In the preceding code snippet, the WHERE condition in the main query does not have any practical meaning because the result of COUNT is never NULL. However, since the CTE is referenced in the query, it is executed before the execution of the main query.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset