Tuning PostgreSQL queries

PostgreSQL provides the means to figure out why a certain query is slow. PostgreSQL behind the scene analyzes the tables, collects statistics from them, and builds histograms using auto vacuuming. Auto vacuuming, in general, is used to recover disk space, update table statistics, and perform other maintenance tasks, such as preventing transaction ID wraparound. Table statistics allow PostgreSQL to pick up an execution plan at the least cost. The least cost is calculated by taking into account the IO and, naturally, CPU cost.

Also, PostgreSQL enables users to see the generated execution plan by providing the EXPLAIN command.

For beginners, it is extremely useful to write the same query in different ways and compare the results. For example, in some cases, the NOT IN construct can be converted to LEFT JOIN or NOT EXIST. Also, the IN construct can be rewritten using INNER JOIN as well as EXISTS. Writing the query in several ways teaches the developer when to use or avoid a certain construct and what the conditions that favor a certain construct are. In general, the NOT IN construct can sometimes cause performance issues because postgres cannot use indexes to evaluate the query.

Another important issue is to keep tracking the new SQL commands and features. The PostgreSQL development community is very active, and their contributions are often targeted to solving common issues. For example, the LATERAL JOIN construct, which was introduced in PostgreSQL 9.3, can be used to optimize certain GROUP BY and LIMIT scenarios.

The EXPLAIN command and execution plan

The first step in tuning PostgreSQL queries is to understand how to read the execution plans generated by the EXPLAIN command. The EXPLAIN command shows the execution plan of a statement and how data from the tables are scanned; for example, the table might be scanned using an index or sequential scan. Also, it shows how the tables are joined, the join method, and the estimated number of rows. The EXPLAIN command also has several options; the ANALYZE option, causes the statement to be executed and returns the actual time and number of rows. Finally, the EXPLAIN command can give insights into buffer's usage and caching. The synopsis for EXPLAIN command is as follows:

EXPLAIN [ ( option [, ...] ) ] statement
EXPLAIN [ ANALYZE ] [ VERBOSE ] statement

where option can be one of:

    ANALYZE [ boolean ]
    VERBOSE [ boolean ]
    COSTS [ boolean ]
    BUFFERS [ boolean ]
    TIMING [ boolean ]
    FORMAT { TEXT | XML | JSON | YAML }:

To understand EXPLAIN, let's have a look at the following example:

CREATE TABLE test_explain_1 (
  id INT PRIMARY KEY,
  name TEXT NOT NULL
);

INSERT INTO test_explain_1 SELECT n , md5 (random()::text) FROM generate_series (1, 100000) AS foo(n);
-- To update table statistics
ANALYZE test_explain_1 ;
-- Get the execution plane
EXPLAIN SELECT * FROM test_explain_1;
-- Output
Seq Scan on test_explain_1  (cost=0.00..1834.00 rows=100000 width=37)

In the preceding example, a table was created and random data was generated and inserted into the table. The execution plan is only a one-node sequential scan on the test_explain_1 table, as shown in the preceding code. The number of rows is estimated correctly as we analyzed the table after insert.

Tip

Auto vacuum is important to keep your database's statistics up to date. Certainly, wrong statistics mean nonoptimal execution plans. After database bulk operations, it is good to run the ANALYZE command to update the statistics. Also, one could control the data sampling performed by ANALYZE using ALTER TABLE … ALTER COLUMN …SET STATISTICS <integer>; this will allow a better estimation of rows.

The cost is an estimation of the effort required to execute the query. In the preceding example, the cost, 0.00, is the cost to retrieve the first row, while the cost, 1834.00, is the cost to retrieve all rows, which is calculated as follows:

The EXPLAIN command and execution plan

The number of relation pages and rows can be found in the pg_class, and seq_page_cost and cpu_tuple_cost are planner-related configuration settings. So, the cost 1.834 is calculated as shown in the following example:

SELECT relpages*current_setting('seq_page_cost')::numeric + reltuples*current_setting('cpu_tuple_cost')::numeric as cost
FROM pg_class
WHERE relname='test_explain_1';
 cost
------
 1834
(1 row)

For the simple case of a sequential scan, it is almost straightforward to calculate the cost. However, when a query involves predicates' evaluation, grouping, ordering, and joining, cost estimation becomes complicated.

Finally, the width 37 is the average width of the tuple in bytes. This information can be found in the pg_stats table.

To really execute and get the cost in real time, one could use EXPLAIN (ANALYZE), as follows:

EXPLAIN (ANALYZE) SELECT * FROM test_explain_1 WHERE id >= 10 and id < 20;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using test_explain_1_pkey on test_explain_1  (cost=0.29..8.47 rows=9 width=37) (actual time=0.008..0.011 rows=10 loops=1)
   Index Cond: ((id >= 10) AND (id < 20))
 Total runtime: 0.031 ms
(3 rows)

In the preceding query, the planner got a very close estimation, as compared to the real values: it estimated 9 rows instead of 10. Also, the planner-generated execution plan now uses an index scan. In the execution plan, one could also see other information such as the number of loops and actual time.

Note that the execution plan is two lines with different indentations now. One should read the execution plan bottom-up and from the most to least indented. This can be seen more clearly in the following example, which performs a self-join on test_explain_1, as follows:

car_portal=# EXPLAIN SELECT * FROM test_explain_1 a JOIN test_explain_1 b ON (a.id = b.id) WHERE a.id < 100;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.58..694.89 rows=90 width=74)
   ->  Index Scan using test_explain_1_pkey on test_explain_1 a  (cost=0.29..9.87 rows=90 width=37)
         Index Cond: (id < 100)
   ->  Index Scan using test_explain_1_pkey on test_explain_1 b  (cost=0.29..7.60 rows=1 width=37)
         Index Cond: (id = a.id)
(5 rows)

The EXPLAIN (BUFFERS) option shows the effect of caching and whether the cache is configured properly. To take a look at the complete effect of caching, one needs to perform cold and hot testing. The following example shows this using three pages read from the buffers:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test_explain_1 WHERE id >= 10 and id < 20;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using test_explain_1_pkey on test_explain_1  (cost=0.29..8.47 rows=9 width=37) (actual time=0.008..0.010 rows=10 loops=1)
   Index Cond: ((id >= 10) AND (id < 20))
   Buffers: shared hit=3
 Total runtime: 0.026 ms
(4 rows)

Detecting problems in query plans

The EXPLAIN command can show why the query is slow, especially if the two options—BUFFER and ANALYZE—are used. There are some hints that enable us to decide whether the execution plan is good or not; these hints are as follows:

  • The estimated row number in comparison with the actual rows: This is important because this parameter defines the method of the query's execution. There are two cases: the estimated number of rows may either be overestimated or underestimated. Wrong estimation affects the entire algorithm, which is used to fetch data from the hard disk, sort it, join it, and so on. In general, if the number of rows is overestimated, this affects performance, but not as much as if the number of rows is underestimated. For example, if one performs a nested loop join on very big tables, it will make the query very slow.
  • In-memory or in-disk sort operation: When performing a sorting operation, such as DISTINCT, if there is enough memory, this operation will be performed in the RAM.
  • Buffer cache: This shows how much of the data is cached and what the hit ratio is.

In order to show a wrong execution plan, let's confuse postgres by performing an operation on the column id, as follows:

EXPLAIN SELECT * FROM test_explain_1 WHERE upper(id::text)::int < 20;
                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on test_explain_1  (cost=0.00..3334.00 rows=33333 width=37)
   Filter: ((upper((id)::text))::integer < 20)
(2 rows)

In the previous case, postgres was not able to evaluate upper(id::text)::int < 20 properly. Also, it cannot use the index scan because there is no index matching the column expression. Note that the number of returned rows is extremely high, as compared to the real number as well. If this query were a subquery of another query, the error would be cascaded because it might be executed several times.

Finally, knowledge about different algorithms, such as the nested loop join, hash join, index scan, bitmap index scan, and so on, can be useful in detecting the root cause of performance degradation.

Common mistakes in writing queries

There are some common mistakes and bad practices that a developer may fall into, which are as follows.

Unnecessary operations

Some developers often use DISTINCT or do not know the difference between UNION, UNION ALL, EXCEPT, EXCEPT ALL, and so on. This causes slow queries, especially if the expected number of rows is high. For example, the following two queries are equivalent, but the one with DISTINCT is much slower:

	iming
car_portal=# SELECT * FROM test_explain_1;
Time: 71.258 ms
car_portal=# SELECT DISTINCT * FROM test_explain_1;
Time: 201.143 ms

Another common mistake is to use DISTINCT with UNION, as in the following query:

SELECT DISTINCT filed_list FROM table_1 UNION 
SELECT DISTINCT filed_list FROM table_2 

The UNION query would eliminate all duplicates, even if table_1 or table_2 have duplicates.

Another common mistake is to use order by in a view definition. If ORDER BY is used when selecting data from the view, it also introduces unnecessary sort operations, as in the following query:

CREATE OR REPLACE VIEW test_explain_1_VIEW AS SELECT * FROM test_explain_1;
EXPLAIN SELECT * FROM test_explain_1_view order by id ASC;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Index Scan using test_explain_1_pkey on test_explain_1  (cost=0.29..3441.29 rows=100000 width=37)
(1 row)

# CREATE OR REPLACE VIEW test_explain_1_VIEW AS SELECT * FROM test_explain_1 ORDER BY id DESC;
# EXPLAIN SELECT * FROM test_explain_1_view order by id ASC;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Sort  (cost=15483.11..15733.11 rows=100000 width=37)
   Sort Key: test_explain_1.id
   ->  Index Scan Backward using test_explain_1_pkey on test_explain_1  (cost=0.29..3441.29 rows=100000 width=37)
(3 rows)

Note, in the preceding example, the cost difference between the two SELECT queries on different views.

Misplaced indexes

Missing indexes on column expressions causes a full table scan; the most common use case is to allow the user to perform operations on case-insensitive data, such as search and login. For example, one could log in using the login name in a case-sensitive way. This could be achieved using lower or upper case matching. To test this, let's create another table, as follows, as the md5 hashing only generates lower text:

CREATE OR REPLACE FUNCTION generate_random_text ( int  ) RETURNS TEXT AS
$$
SELECT string_agg(substr('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', trunc(random() * 62)::integer + 1, 1), '')  FROM  generate_series(1, $1)
$$
LANGUAGE SQL;

CREATE TABLE login as SELECT n, generate_random_text(8) as login_name  FROM generate_series(1, 1000) as foo(n);
CREATE INDEX ON login(login_name);
VACUUM ANALYZE login;

The generate_random_text(n) function is used to generate random text of length n. Let's assume that we want to check whether an entry exists in table one; we could do this as follows:

EXPLAIN SELECT * FROM login WHERE login_name = 'jxaG6gjJ';
Index Scan using login_login_name_idx on login  (cost=0.28..8.29 rows=1 width=13)
  Index Cond: (login_name = 'jxaG6gjJ'::text)  

As seen in the preceding example, an index scan is used as there is an index on login_name.

Using functions on constant arguments also causes the index to be used if this function is not volatile as the optimizer evaluates the function as follows:

EXPLAIN SELECT * FROM login WHERE login_name = lower('jxaG6gjJ');
Index Scan using login_login_name_idx on login  (cost=0.28..8.29 rows=1 width=13)
  Index Cond: (login_name = 'jxag6gjj'::text)

Using functions on columns, as stated in the preceding example, causes a sequential scan, as shown in the following query:

EXPLAIN SELECT * FROM login WHERE lower(login_name) = lower('jxaG6gjJ');
"Seq Scan on login  (cost=0.00..21.00 rows=5 width=13)"
"  Filter: (lower(login_name) = 'jxag6gjj'::text)"

Note that here, the number of rows returned is also five as the optimizer cannot evaluate the predict correctly. To solve this issue, simply add an index, as follows:

CREATE INDEX ON login(lower(login_name));
EXPLAIN SELECT * FROM login WHERE lower(login_name) = lower('jxaG6gjJ');
Index Scan using login_lower_idx on login  (cost=0.28..8.29 rows=1 width=13)"
  Index Cond: (lower(login_name) = 'jxag6gjj'::text)

Also, text indexing is governed by the access pattern. In general, there are two ways to index text: the first approach is to use an index with opclass, which allows anchored text search, and the second approach is to use tsquery and tsvector. In the test_explain_1 table, one could create an opclass index, as follows:

SELECT * FROM test_explain_1 WHERE name like 'a%';
Time: 19.565 ms
CREATE INDEX on test_explain_1 (name text_pattern_ops);
SELECT * FROM test_explain_1 WHERE name like 'a%';
Time: 7.860 ms
EXPLAIN SELECT * FROM test_explain_1 WHERE name like 'a%';
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_explain_1  (cost=281.22..1204.00 rows=7071 width=37)
   Filter: (name ~~ 'a%'::text)
   ->  Bitmap Index Scan on test_explain_1_name_idx  (cost=0.00..279.45 rows=7103 width=0)
         Index Cond: ((name ~>=~ 'a'::text) AND (name ~<~ 'b'::text))
(4 rows)

Note the execution plan after creating the index; an index scan was used instead of a sequential scan, and performance increased.

Unnecessary table or index scans

Often, one can see queries that cause several table scans if one uses a select statement in a select list or the query is not well written. For example, let's assume that there is a query to get the count of bad and good ranking as per a certain advertisement, which is written as follows:

SELECT
(SELECT count(*) FROM car_portal_app.advertisement_rating WHERE rank = 1 AND advertisement_rating_id = 1) AS good,
(SELECT COUNT(*) FROM car_portal_app.advertisement_rating WHERE rank = 5 AND advertisement_rating_id = 1) AS bad;

This query caused the same index to be scanned twice, as shown in the execution plan. Take a look at the following query:

Result  (cost=16.36..16.38 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=8.17..8.18 rows=1 width=0)
           ->  Index Scan using advertisement_rating_pkey on advertisement_rating  (cost=0.15..8.17 rows=1 width=0)
                 Index Cond: (advertisement_rating_id = 1)
                 Filter: (rank = 1)
   InitPlan 2 (returns $1)
     ->  Aggregate  (cost=8.17..8.18 rows=1 width=0)
           ->  Index Scan using advertisement_rating_pkey on advertisement_rating advertisement_rating_1  (cost=0.15..8.17 rows=1 width=0)
                 Index Cond: (advertisement_rating_id = 1)
                 Filter: (rank = 5)

The preceding query could be written using COUNT FILTER or COUNT CASE expression END, as follows:

SELECT count(*) FILTER (WHERE rank=5) as bad, count(*) FILTER (WHERE rank=1) as good FROM car_portal_app.advertisement_rating WHERE advertisement_rating_id = 1;

Using correlated nested queries

Correlated nested queries can cause performance issues because the subquery is executed within a loop; for example, the following query is not optimal:

CREATE TABLE test_explain_2 AS SELECT n as id, md5(n::text) as name FROM generate_series(1, 1000) as foo(n);
SELECT 1000
Time: 14.161 ms
SELECT * FROM test_explain_1 WHERE EXISTS (SELECT 1 FROM test_explain_2 WHERE id = id);
Time: 78.533 ms

The preceding query could be written using the INNER JOIN or IN construct, as follows:

# SELECT test_explain_1.* FROM test_explain_1 INNER JOIN  test_explain_2 USING (id);
Time: 2.111 ms
SELECT * FROM test_explain_1 WHERE id IN (SELECT id FROM test_explain_2);
Time: 2.143 ms

Using CTE when not mandatory

Common table expressions (CTEs) are an amazing feature; they allow the developer to write very complex logic. Also, a CTE can be used in several places to optimize performance. However, using a CTE may be problematic in the case of predicate push down as PostgreSQL does not optimize beyond CTE boundaries; each CTE runs in isolation. To understand this limitation, let's have a look at the following two dummy-equivalent examples and note the difference between their performance:

car_portal=# With test_explain AS (SELECT * FROM test_explain_1) SELECT * FROM test_explain WHERE id = 4;
 id |               name
----+----------------------------------
  4 | aa0cca507bdb343206f579cab2a46169
(1 row)

Time: 82.280 ms
SELECT * FROM (SELECT * FROM test_explain_1) as foo WHERE id = 4;
 id |               name
----+----------------------------------
  4 | aa0cca507bdb343206f579cab2a46169
(1 row)
Time: 0.643 ms

Using the PL/pgSQL procedural language consideration

PL/pgSQL language caching is an amazing tool to increase performance; however, if the developer is not careful, it may lead to a bad execution of plans. For example, let's assume that we want to wrap the following query in a function:

SELECT * FROM test_explain_1 WHERE id <= <predicate>;

In this example, we should not use caching as the execution plan might differ depending on the predicate value. For example, if we use the value 1 as predicate, postgres will perform an index scan. However, if we use the predicate 90,000, postgres will most probably use an sequential scan. Due to this caching, the execution of this query is wrong. However, consider that the preceding select statement is as follows:

SELECT * FROM test_explain_1 WHERE id = <predicate>;

In this case it is better to cache it as the execution plan is the same for all the predicates due to the index on the id column.

Also, exception handling in PL/pgSQL is quite expensive, so one should be careful while using this feature.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset