Index-only scans

Indexes have already been described in Chapter 04, PostgreSQL Advanced Building Blocks. Simply speaking, indexes work like a glossary at the end of a book. When searching for a keyword in a book, to make it faster one can look it up in the glossary and then go to the page specified. The glossary is alphabetically organized; that's why searching in it is fast. However, when it is necessary just to find out if a keyword exists in the book, there is no need to go to the page. Just looking in the glossary is enough.

PostgreSQL can do the same. If all the information that is needed for a query is contained in the index, the database will not perform the scan on the table data and only use the index. This is called an index-only scan.

To demonstrate how it works, let's create an index for the table dwh.access_log_not_partitioned, as follows:

CREATE INDEX on dwh.access_log_not_partitioned (ts, status_code);

Now, suppose we want to find out when the first HTTP request that resulted in a status code 201 happened on August 1. The query will be as follows:

car_portal=> SELECT min(ts) FROM dwh.access_log_not_partitioned 
  WHERE ts BETWEEN '2017-08-01' AND '2017-08-02' AND status_code = '201';
 min 
------------------------
 2017-08-01 01:30:57+00
(1 row)
Time: 0.751 ms

The query took less than a millisecond. From the execution plan, it is clear that PostgreSQL did not scan the table and performed an index-only scan:

car_portal=> EXPLAIN SELECT min(ts) FROM dwh.access_log_not_partitioned 
  WHERE ts BETWEEN '2017-08-01' AND '2017-08-02' AND status_code = '201';
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result (cost=4.23..4.24 rows=1 width=8)
   InitPlan 1 (returns $0)
     -> Limit (cost=0.56..4.23 rows=1 width=8)
       -> Index Only Scan using access_log_not_partitioned_ts_status_code_idx on 
          access_log_not_partitioned (cost=0.56..135923.57 rows=37083 width=8)
            Index Cond: ((ts IS NOT NULL) AND (ts >= '2017-08-01 00:00:00+00'::timestamp with time zone) 
            AND (ts <= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (status_code = 201))
(5 rows)

To see the benefit of applying this feature, let's switch it off and check the timing again, as follows:

car_portal=> SET enable_indexonlyscan = off;
SET

car_portal=> SELECT min(ts) FROM dwh.access_log_not_partitioned 
  WHERE ts BETWEEN '2017-08-01' AND '2017-08-02' AND status_code = '201';
 min 
------------------------
 2017-08-01 01:30:57+00
(1 row)
Time: 1.225 ms

Now the query is almost two times slower. The execution plan is slightly different (though the total cost is the same):

car_portal=> EXPLAIN SELECT min(ts) FROM dwh.access_log_not_partitioned 
  WHERE ts BETWEEN '2017-08-01' AND '2017-08-02' AND status_code = '201';
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result (cost=4.23..4.24 rows=1 width=8)
   InitPlan 1 (returns $0)
     -> Limit (cost=0.56..4.23 rows=1 width=8)
       -> Index Scan using access_log_not_partitioned_ts_status_code_idx on 
          access_log_not_partitioned (cost=0.56..135923.57 rows=37083 width=8)
            Index Cond: ((ts IS NOT NULL) AND (ts >= '2017-08-01 00:00:00+00'::timestamp with time zone) 
            AND (ts <= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (status_code = 201))
(5 rows)

This feature can only work if all the fields that are referenced in a query are part of an index. In some cases, the optimizer may conclude that a sequential scan on the table will be cheaper, even if the index has all the data. This is usually true when a big number of records are supposed to be returned.

Index-only scans have certain limitations due to how PostgreSQL maintains transaction isolation. For details, see Chapter 10, Transactions and Concurrency Control. In short, PostgreSQL checks the visibility maps for the table to decide whether certain data should be visible to the current transaction. If it cannot find this information there, checking the table data is necessary.

Not all index types support index-only scans. B-tree indexes always support them, GiST and SP-GiST support then only for certain operators, and GIN indexes do not support this feature.

This topic does not belong specifically to OLAP solutions. However, in data warehouses, it is common to create multiple indexes on different combinations of fields, sometimes overlapping each other. Understanding how index-only scans work can help with designing a better structure for a database.

More on index-only scans can be found in the PostgreSQL documentation at https://www.postgresql.org/docs/current/static/indexes-index-only-scans.html.

Table of Contents for Index-only scans

Create new playlist

Sign In

Sign Up

Table of Contents for
Index-only scans