Indexes

An index is a physical database object that is defined on a table column or a list of columns. In PostgreSQL, there are many types of indexes and several ways to use them. Indexes can be used, in general, to do the following:

  • Optimize performance: an index allows an efficient retrieval of a small number of rows from the table. The small number is often determined by the total number of rows in the table and execution planner settings.
  • Validate constraints, instead of checking constraints. An index can be used to validate the constraints on several rows. For example, the UNIQUE check constraint creates a unique index on the column behind the scenes.

Let us take a look at the account_history table in the car web portal example. The unique constraint, UNIQUE (account_id, search_key, search_date), has two purposes: the first purpose is to define the validity constraint for inserting the search key into the table only once each day, even if the user searches for the key several times. The second purpose is to retrieve the data quickly. Let us assume that we would like to show the last 10 searches for the user. The query for performing this will be as follows:

SELECT search_key FROM account_history WHERE account_id = <account> GROUP BY search_key ORDER BY max(search_date) limit 10;

The preceding query returns only 10 records containing the different search_key ordered by search_date. If we assume that the search account_history contains millions of rows, then reading all the data will take a lot of time. In this case, this is not true, because the unique index will help us in reading the columns only for this particular customer. To understand how PostgreSQL uses indexes, let us populate the table with fictional data.

  • Fictional data 1: Two users and four search keys for each—in total, eight records will be inserted into the table:
    WITH test_account AS( INSERT INTO account VALUES (1000, 'test_first_name', 'test_last_name','[email protected]', 'password'),
    (2000, 'test_first_name2', 'test_last_name2','[email protected]', 'password2') RETURNING account_id
    ),car AS ( SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV')) AS foo(i)
    ),manufacturing_date AS ( SELECT  'year='|| i as date FROM generate_series (2015, 2014, -1) as foo(i))
    INSERT INTO account_history (account_id, search_key,
    search_date) SELECT account_id, car.car_model||'&'||manufacturing_date.date, current_date
    FROM test_account, car, manufacturing_date;
    

    To understand how the data is generated, we simply created a cross join, as discussed in Chapter 1, Relational Databases. Since we have two brands and two manufacturing dates, we generated four combinations. To see the generated date, let us execute the preceding query for the test account with account_id, which has the value 1000:

    car_portal=# SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY (search_key) ORDER BY max(search_date) limit 10;
         search_key
    ---------------------
     brand=BMW&year=2014
     brand=WV&year=2014
     brand=BMW&year=2015
     brand=WV&year=2015
    (4 rows)
    

    For small data, PostgreSQL prefers to perform a sequential scan simply due to the execution cost. On a small data set, a sequential scan is faster than an index scan. To get the execution plan, one can execute the following command:

    car_portal=# EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10;
                                         QUERY PLAN
    -------------------------------------------------------------------------------------
     Limit  (cost=38.64..38.67 rows=10 width=23)
       ->  Sort  (cost=38.64..39.43 rows=313 width=23)
             Sort Key: (max(search_date))
             ->  HashAggregate  (cost=28.75..31.88 rows=313 width=23)
                   ->  Seq Scan on account_history  (cost=0.00..25.63 rows=625 width=23)
                         Filter: (account_id = 1000)
    (6 rows)
    
  • Fictional data 2: Use the same preceding code, and keep increasing the number of test accounts, manufacturing dates, and the car models to generate a big data set, as follows. Keep generating the execution plan until you get an index scan:
    ...
    (5000, 'test_first_name5', 'test_last_name5','[email protected]', 'password5')
    ...
    SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV'), ('brand=Opel'), ('brand=Fiat'), ('brand=Siat')) AS foo(i)
    ...
    SELECT  'year='|| i as date FROM generate_series (2015, 1980, -1) as foo(i)
    

    With a big data set, PostgreSQL performs an index scan instead of a sequential scan to retrieve the data. The following shows the execution plan with an index scan for the preceding query:

    car_portal=# EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10;
                                                                   QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=26.07..26.09 rows=10 width=24)
       ->  Sort  (cost=26.07..26.19 rows=50 width=24)
             Sort Key: (max(search_date))
             ->  HashAggregate  (cost=24.49..24.99 rows=50 width=24)
                   ->  Bitmap Heap Scan on account_history  (cost=10.18..23.26 rows=246 width=24)
                         Recheck Cond: (account_id = 1000)
                         ->  Bitmap Index Scan on account_history_account_id_search_key_search_date_key  (cost=0.00..10.12 rows=246 width=0)
                               Index Cond: (account_id = 1000)
    (8 rows)
    

    The PostgreSQL planner decides whether to use an index based on the execution plan cost. For the same query with different parameters, the planner might pick a different plan based on the data histogram.

Index types

PostgreSQL supports different indexes; each index can be used for a certain scenario or a data type.

  • B-tree index: This is the default index in PostgreSQL when the index type is not specified with the CREATE INDEX command. The "B" stands for balanced, which means that the data on both sides of the tree is roughly equal. B-tree can be used for equality, ranges, and null predicates. The B-tree index supports all PostgreSQL data types.
  • Hash indexes: Hash indexes are not well supported in PostgreSQL. They are not transaction-safe, and are not replicated to the slave nodes in streaming replication. They are useful for equality predicates, but since a B-tree index can support this use case, it is not recommended to use the Hash indexes.
  • Generalized inverted index (GIN): The GIN index is useful when several values need to map to one row. It can be used with complex data structures such as arrays and full-text searches.
  • Generalized Search Tree (GiST): The GiST indexes allow building of general balanced tree structures. They are useful in indexing geometric data types, as well as full-text search.
  • Block range index (BRIN): This will be introduced in PostgreSQL 9.5. The BRIN index is useful for very large tables where the size is limited. A BRIN index is slower than a B-tree index, but requires less space than the B-tree.

Partial indexes

A partial index indexes only a subset of the table data that meets a certain predicate; the WHERE clause is used with the index. The idea behind a partial index is to decrease the index size, thus making it more maintainable and faster to access. Let us assume that we would like to introduce a data life cycle for some tables. For the advertisement table in the car web portal, suppose that we would like to keep the advertisement for a certain period of time after deletion to enable the seller to revive their advertisement easily. The deletion operation will be emulated by adding a column called advertisement_deletion_date and a job scheduler task to clean up the advertisements that have been deleted. For the SQL part, this could be done as follows:

car_portal=# ALTER TABLE advertisement ADD column advertisement_deletion_date TIMESTAMP WITH TIME ZONE;
ALTER TABLE

In order to retrieve the dead advertisement quickly, one could use the partial index on the advertisement_deletion_date, as follows:

CREATE INDEX ON advertisement(advertisement_id) WHERE advertisement_deletion_date IS NOT NULL;
CREATE INDEX

The preceding index will help in retrieving the dead advertisement quickly without creating an index on the whole advertisement table.

Indexes on expressions

As stated earlier, an index could be created on a table column or multiple columns. It can also be created on expressions and function results. In the account table, let us assume that we would like to search for an account with the first name foo. Also, let us assume that the names are stored in the database in different cases such as Foo, foo, and fOo, and that there is an index on the first_name. To get the accounts, one can use the lower() and upper() functions or the case insensitive regular expressions operator ~*. The queries can be written as follows:

  car_portal=# INSERT INTO account values (DEFAULT, 'fOo', 'bar', '[email protected]', md5('foo'));
INSERT 0 1
car_portal=# SELECT * FROM account WHERE first_name ~* '^foo$';
 account_id | first_name | last_name |    email    |             password
------------+------------+-----------+-------------+----------------------------------
          1 | fOo        | bar       | [email protected] | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)
car_portal=# SELECT * FROM account WHERE lower(first_name) = 'foo';
 account_id | first_name | last_name |    email    |             password
------------+------------+-----------+-------------+----------------------------------
          1 | fOo        | bar       | [email protected] | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)

The two preceding approaches will not use an index scan, which is the preferable method, because only one row is retrieved. When using the lower(first_name) function, the value of the function is not determined unless the function is executed. Thus, the index cannot be used. For the regular expressions, the expression might match one or more values. To solve this, an index on the lower() function can be used as follows, and the lower() function should then be used instead of regex:

CREATE index ON account(lower(first_name));
SELECT * FROM account WHERE lower(first_name) = 'foo';

Another usage of the expression indexes is to find rows by casting a data type to another data type. For example, the birth of a child can be stored as a timestamp; however, we often search for a birth date and not the birth time.

Unique indexes

The unique index guarantees the uniqueness of a certain value in a row across the whole table. In the account table, the email column has a unique check constraint. This is implemented by the unique index, as shown by the d meta command:

 car_portal=# d account
                           Table "car_portal_app.account"
   Column   |  Type   |                          Modifiers
------------+---------+--------------------------------------------------------------
 account_id | integer | not null default nextval('account_account_id_seq'::regclass)
 first_name | text    | not null
 last_name  | text    | not null
 email      | text    | not null
 password   | text    | not null
Indexes:
    "account_pkey" PRIMARY KEY, btree (account_id)
    "account_email_key" UNIQUE CONSTRAINT, btree (email)
    "account_lower_idx" btree (lower(first_name))
Check constraints:
...
Referenced by:
...

Note that in the index section of the d command there are three indexes; the account_email_key is a B-tree index for the email column.

Tip

In the PostgreSQL renaming conventions, the suffixes for unique and normal indexes are key and idx respectively.

One can also use the unique and partial indexes together. For example, let us assume that we have a table called employee, where each employee must have a supervisor except for the company head. We can model this as a self-referencing table, as follows:

CREATE TABLE employee (employee_id INT PRIMARY KEY, supervisor_id INT);
ALTER TABLE employee ADD CONSTRAINT supervisor_id_fkey FOREIGN KEY (supervisor_id) REFERENCES employee(employee_id);

To guarantee that only one row is assigned to a supervisor, we can add the following unique index:

CREATE UNIQUE INDEX ON employee ((1)) WHERE supervisor_id IS NULL;

The unique index on the constant expression (1) will allow only one row with a null value. With the first insert of a null value, a unique index will be built using the value 1. A second attempt to add a row with a null value will cause an error, because the value 1 is already indexed:

car_portal=# INSERT INTO employee VALUES (1, NULL);
INSERT 0 1
car_portal=# INSERT INTO employee VALUES (2, 1);
INSERT 0 1
car_portal=# INSERT INTO employee VALUES (3, NULL);
ERROR:  duplicate key value violates unique constraint "employee_expr_idx"
DETAIL:  Key ((1))=(1) already exists.

Multicolumn indexes

A multicolumn index can be used for a certain pattern of query conditions. Suppose a query has a pattern similar to the following:

SELECT * FROM table WHERE column1 = constant1 and column2= constant2 AND … columnn = constantn;

In this case, an index can be created on column1, column2,…, and columnn, where n is less than or equal to 32. Currently, only B-tree, GIN, and GiST support multicolumn indexes. When creating a multicolumn index, the column order is important. A multicolumn index can be used to retrieve data if the leftmost columns are used with equality and inequality expressions on the first column that does not have any equality expressions.

Since the multicolumn index size is often big, the planner might prefer to perform a sequential scan rather than reading the index.

Best practices on indexes

It is often useful to index columns that are used with predicates and foreign keys. This enables PostgreSQL to use an index scan instead of a sequential scan. The benefits of indexes are not limited to the SELECT statements, but the DELETE and UPDATE statements can also benefit from it. There are some cases where an index is not used, and this is often due to the small table size. In big tables, one should plan the space capacity carefully, since the index size might be very big. Also note that indexes have a negative impact on the INSERT statements, since amending the index comes with a cost.

There are several catalogue tables and functions that help in maintaining the indexes. The first view is pg_stat_all_indexes, which gives the statistics about the index usage. In the following example, we can see that the index account_history_account_id_search_key_search_date_key is never used, which is fine in this case, since the database has just been created:

car_portal=# SELECT * FROM pg_stat_all_indexes WHERE schemaname = 'car_portal_app' limit 1;
  relid  | indexrelid |   schemaname   |     relname     |                     indexrelname                      | idx_scan | idx_tup_read | idx_tup_fetch
---------+------------+----------------+-----------------+-------------------------------------------------------+----------+--------------+---------------
 8053962 |    8053971 | car_portal_app | account_history | account_history_account_id_search_key_search_date_key |        0 |            0 |             0
(1 row)

The function pg_index_size can be used with the pg_size_pretty function to get the index size in a human-readable form, as follows:

car_portal=# SELECT pg_size_pretty(pg_indexes_size ('car_portal_app.account_pkey'));
 pg_size_pretty
----------------
 0 bytes
(1 row)

When creating an index, make sure that the index does not exist, otherwise one could end up with duplicate indexes. PostgreSQL does not raise a warning when creating duplicate indexes, as shown in the following example:

car_portal=# CREATE index on car_portal_app.account(first_name);
CREATE INDEX
car_portal=# CREATE index on car_portal_app.account(first_name);
CREATE INDEX

In some cases, an index might be bloated; PostgreSQL provides the REINDEX command, which is used to rebuild the index. Note that the REINDEX command is a blocking command. To solve this, one can create another index identical to the original index concurrently to avoid locks. Creating an index concurrently is often preferred with a live system, but it requires more work than creating a normal index. Moreover, creating an index concurrently has some caveats; in some cases index creation might fail, leading to an invalid index. An invalid index can be dropped or rebuilt using the DROP and REINDEX statements respectively.

Let us assume that we have a table called a, and an index on this table called a_index. Let us also assume that the index is bloated, and we need to build it. The following two pieces of code could be considered equivalent with one difference, that is, the locking behavior:

--- create a test table
car_portal=# CREATE TABLE a (id int);
CREATE TABLE
car_portal=# CREATE INDEX a_index ON a(id);
CREATE INDEX
--- Piece of code1: cause locks
car_portal=# REINDEX INDEX a_index;
REINDEX
--- Piece of code 2: No locks
car_portal=# CREATE INDEX concurrently a_index_1 ON a(id);
CREATE INDEX
car_portal=# DROP index a_index;
DROP INDEX
car_portal=# ALTER INDEX a_index_1 RENAME TO a_index;
ALTER INDEX

PostgreSQL indexes is a huge topic. Indexes can also be used to answer certain queries without doing a table scan at all. They can, additionally, be used to speed up sorting operations. Also, they can be used for anchored and non-anchored text searches using operator classes and operator families. Finally, PostgreSQL supports full-text searches using the GIN and GiST indexes and the text search types: tsvector and tsquery.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset