Chapter 6: How Indexes Work in CockroachDB

In Chapter 5, Fault Tolerance and Auto-Rebalancing, we learned about fault-tolerance and auto-recovery strategies in CockroachDB. In this chapter, we will learn everything about indexes, what they are, and how they improve query times.

Although indexes help improve the read performance, a wrong index can slow down the queries, including the writes, and take up more storage space. So, it is important to identify the query pattern and create appropriate indexes.

The following topics will be covered in this chapter:

  • Introduction to indexes
  • Different types of indexes
  • Best practices while using indexes

Technical requirements

We will try creating various types of indexes in this chapter which would require you to have CockroachDB installed. If you still haven't done so, please refer to the Technical requirements section in Chapter 2, How Does CockroachDB Work Internally?.

Introduction to indexes

An index or a database index helps with returning the query results quickly, by avoiding full table scans. An index can be created for a specific table and can include one or more keys. Keys refer to the columns in the table. However, there will be extra space used to keep a separate sorted copy of indexed columns.

Let's take a simple example and see how an index works.

Consider a population table with the following columns and some sample values:

Figure 6.1 – Population table

Figure 6.1 – Population table

Now, let's say you just want to retrieve the list of populations for specific continents, for example:

SELECT population_in_millions, country FROM population WHERE continent = "Asia";

Here, in order to find rows 1 and 4, which are countries in Asia, you would have to iterate through each of the rows in the table, which is called a full table scan.

Now, if you want to avoid a full table scan, you can create an index on the continent as follows:

CREATE INDEX ON population (continent);

Internally, CockroachDB keeps track of all the continents and keeps a mapping from a given continent to all its relevant rows, as shown here:

Africa -> (3)

Asia    -> ( 1, 4 )

Europe -> ( 5 )

North America -> (2 )

Now, if you run the same query, SELECT population_in_millions, country FROM population WHERE continent = "Asia", once again, CockroachDB will identify that the filtering condition has the column continent and there is an index already available for that. So, in this case, based on the value Asia, it will directly retrieve rows (1 , 4) from the continent index and get relevant column values and return them. In this case, it has avoided a full table scan. Although this example only has five rows, the same concept is applicable even when a table contains millions of rows. So, in such cases, avoiding a full table scan can significantly improve the query performance. At the same time, writes tend to get a little slower as, with each write, even the index must be updated.

When you create an index on a column or a set of columns, CockroachDB internally makes a copy of the values of that column or columns and sorts them. So, whenever you execute a query that involves filters on indexed columns, a subset of rows is selected from the index first, rather than scanning the entire table. This improves the overall query performance.

Next, we are going to discuss various types of indexes that are available in CockroachDB.

Different types of indexes

Based on the query pattern and columns in the table, you should decide what kind of index is going to help with the performance.

The following are the types of indexes available in CockroachDB:

  • Primary index
  • Secondary index
  • Hash-sharded index
  • Duplicate indexes
  • Inverted indexes
  • Partial indexes
  • Spatial indexes
  • Table joins and indexes
  • Best practices while using indexes

In the next set of subsections, we are going to discuss each type of index and when to use them, starting with the primary index.

Primary indexes

A primary key uniquely identifies a given row in a table. This means that the primary key is unique for a given row and duplicate values or NULLs are not allowed. An index created for a primary key is called a primary index.

Whenever you create a table in CockroachDB, it's recommended to have an explicit primary key, so that CockroachDB automatically creates an index for it, which can be used to filter the rows for better performance. Even if you don't create a primary key during table creation, CockroachDB by default creates a primary key called rowid, which will have a unique value for each row, but its performance will not be as good as that of the primary key.

Let's understand how indexes work with an example, where we are going to create a database and a table with a primary key:

  1. Create a database called test:

    root@localhost:26257/defaultdb> CREATE DATABASE IF NOT EXISTS test;

    CREATE DATABASE

    Time: 279ms total (execution 279ms / network 0ms)

    Switch to the database 'test'.

    root@localhost:26257/defaultdb> use test;

    SET

    Time: 116ms total (execution 115ms / network 0ms)

  2. Create a table called accounts with id being the primary key:

    root@localhost:26257/test> CREATE TABLE accounts (

             id UUID PRIMARY KEY,

             name string,

             balance INT8

    );

    CREATE TABLE

    Time: 195ms total (execution 195ms / network 0ms)

  3. Now, we will look at the indexes created for the accounts table using the SHOW INDEXES command:

    SHOW INDEXES FROM accounts;

        table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit

    -------------+------------+------------+--------------+-------------+-----------+---------+-----------

        accounts     | primary         |     false         |                             1 | id                        | ASC               |    false    |    false

    (1 row)

    Time: 7ms total (execution 7ms / network 0ms)

In the next section, we will learn about hash-sharded indexes that are used in relation to sequences.

  1. In order to understand how this primary key index helps with query performance, we can use EXPLAIN to look at the statement plans.
  2. If you are retrieving all the accounts without any filters, obviously a full scan is required as we must return all the rows:

    root@localhost:26257/test> explain select * from accounts;

        tree    |         field         |     description

    ------------+-----------------------+-----------------

                | distribution          | full

                | vectorized            | false

        scan    |                       |

                | estimated row count   | 1

                | table                 | accounts@primary

                | spans                 | FULL SCAN

    (6 rows)

  3. Now if you want to retrieve just one row based on the ID, you can avoid a full table scan, since CockroachDB has already indexed the id column.
  4. As you can see in the following explain statement, within the spans, now we no longer do a full table scan:

    root@localhost:26257/test> explain select * FROM accounts where id = '123e4567-e89b-12d3-a456-426655440000';

        tree  |         field         |    description    

    ----------+-----------------------+-------------------

              | distribution          | local

              | vectorized            | false

        scan  |                       |

              | estimated row count   | 1

              | table                 | accounts@primary

              | spans                 | [/'123e4567-e89b-12d3-a456-426655440000' - /'123e4567-e89b-12d3-a456-426655440000']

    (6 rows)

If multiple columns are used in queries, you should also consider creating a composite primary key that includes all the columns that are often used together.

In the next section, we will learn about secondary indexes.

Secondary indexes

A secondary index is an index that you create on non-primary columns. If your query involves retrieving a column that's not a primary key and you want to improve the query performance, you can create secondary indexes. Any index that you create on a non-primary key is called a secondary index, and duplicate values are allowed for secondary indexes. For the test.accounts table, if the query contains a non-primary column such as name, then we would still need a full table scan. Let's try this with an example, where we will just use a non-primary column in the query:

root@localhost:26257/test> explain select name FROM accounts where name = 'crdb' ;

     tree     |         field         |     description

--------------+-----------------------+-------------------

              | distribution          | full

              | vectorized            | false

    filter    |                       |

    │        | filter                | name = 'crdb'

    └── scan |                       |

              | estimated row count   | 1

              | table                 | accounts@primary

              | spans                 | FULL SCAN

(8 rows)

Since we are now filtering on a non-primary column, CockroachDB must inspect each row and apply a filtering condition, and the index on the primary key id doesn't help here. So, let's create one more index on the column name:

root@localhost:26257/test> CREATE INDEX on accounts ( name );

CREATE INDEX

Time: 2.053s total (execution 0.256s / network 1.797s)

Whenever you create a secondary index, CockroachDB automatically creates a composite index including the primary key. Also, the index on the column name is called a secondary index:

root@localhost:26257/test> show indexes from accounts;

    table_name |         index_name          | non_unique | seq_in_index | column_name | direction | storing | implicit

-------------+-------------------+------------+--------------+-------------+-----------+---------+-----------

    accounts     | primary                         |     false         |                             1 | id                        | ASC               |    false    |    false

    accounts     | accounts_name_idx |         true         |                             1 | name                   | ASC               |    false    |    false

    accounts     | accounts_name_idx |         true         |                             2 | id                        | ASC               |    false    |     true

(3 rows)

Now if you run the previous query, you should see that the full table scan is avoided because of the new index that we have created:

root@localhost:26257/test> explain select name from accounts where name = 'crdb' ;

   tree   |       field        |    description

----------+--------------------+--------------------

          | distribution       | local

          | vectorized         | false

     Scan |                    |

          | estimated row count| 1

          | table              | accounts@accounts_name_idx

          | spans              | [/'crdb' - /'crdb']

(6 rows)

Time: 1ms total (execution 1ms / network 0ms)

Hash-sharded indexes

Hash-sharded indexes can improve query performance when you must create an index on a column that's a sequence. Hash-sharded indexes evenly spread the traffic to a sequential range across multiple ranges to avoid hotspots for any given range. Since this is a new experimental feature, the implementation and overall performance might change over time. Let's begin:

  1. Within the client session, you have to first enable this feature as shown in the following code block:

    root@localhost:26257/test> set experimental_enable_hash_sharded_indexes = ON;

    SET

    Time: 1ms total (execution 0ms / network 0ms)

  1. Let's create a table called customers with integer and string data types. Here, the id column is supposed to be a sequence:

    root@localhost:26257/test> create TABLE customers ( id int PRIMARY KEY, name string);

    CREATE TABLE

    Time: 160ms total (execution 160ms / network 0ms)

  2. Now, let's create the hash-sharded index for this primary key, as shown in the following code:

    root@localhost:26257/test> ALTER TABLE customers ALTER PRIMARY KEY USING COLUMNS (id) USING HASH WITH BUCKET_COUNT = 10;

    NOTICE: primary key changes are finalized asynchronously; further schema changes on this table may be restricted until the job completes

    ALTER TABLE

    Time: 4.551s total (execution 0.297s / network 4.253s)

When you create a hash-sharded index, CockroachDB creates n_buckets computed columns, shards the primary index ID into n_buckets number of shards, and then stores each index shard in the underlying key-value store with one of the computed column's hashes as its prefix.

  1. Let's look at how the indexes on the customers table look now:

    root@localhost:26257/test> show indexes from customers;

        table_name |         index_name         | non_unique | seq_in_index |                    column_name                    | direction | storing | implicit

    -------------+------------------+------------+--------------+-----------------------------+-----------+---------+-----------

        customers    | primary                        |     false         |                             1 | crdb_internal_id_shard_5000 | ASC               |    false    |    false

        customers    | primary                        |     false         |                             2 | id                                                                | ASC               |    false    |    false

        customers    | customers_id_key |     false         |                             1 | id                                                                | ASC               |    false    |    false

        customers    | customers_id_key |     false         |                             2 | crdb_internal_id_shard_5000 | ASC               |    false    |     true

    (4 rows)

    Time: 34ms total (execution 26ms / network 8ms)

You can create a hash-sharded secondary index as well.

Duplicate indexes

Duplicate indexes improve the read performance. Please refer to Chapter 4, Geo-Partitioning, where we discussed duplicate indexes and how they work internally.

Next, we will learn about inverted indexes.

Inverted indexes

Inverted indexes store the mapping of values within JSONB, arrays, and spatial data to the row that holds that value. For example, if you have a column where you are storing a JSON document, and let's say that JSON document contains a key called country, then you can add a WHERE clause in your query, where you can say get me all the rows that have country:USA and country:Canada.

Inverted indexes filter on components of tokenizable data. The JSONB data type is built on two structures that can be tokenized:

  • Objects – Collections of key and value pairs where each key-value pair is a token
  • Arrays – Lists of values where every value in the array is a token

Let's look at the following JSON document:

    "student": [

          {

             "id":"01",

             "firstname": "Steve",

             "lastname": "Jobs"

          },

          {

             "id":"02",

             "firstname": "Steve",

             "lastname": "Wozniak"

          }

    ]    

Now, the inverted index for the preceding JSON will have an entry for each component, which maps to the original document as follows:

"student" :    "id" : "01"

"student" :    "firstname" : "Steve"

"student" :    "lastname" : "Jobs"

"student" :    "id" : "02"

"student" :    "lastname": "Wozniak"

Now you can search the JSON document based on student ID, student first name, student last name, and so on.

Partial indexes

A partial index is typically created based on a Boolean expression. CockroachDB internally indexes the columns and rows that evaluate to true for a given expression.

Let's understand partial indexes with an example:

  1. First, we will create the table books with a few columns:

    root@localhost:26257/test> create table books ( id int, title string, author string, price float );

    CREATE TABLE

    Time: 270ms total (execution 269ms / network 1ms)

  2. Let's create the partial index based on the price of the book. Here, we are creating an index for all the books that are priced more than $50:

    root@localhost:26257/test> CREATE INDEX ON books (id, title, author) WHERE price > 50.00;

    CREATE INDEX

    Time: 2.596s total (execution 0.314s / network 2.282s)

  3. Now, whenever you use a filtering condition that matches with the one in the partial index, a partial index will be used to retrieve a subset of rows:

    root@localhost:26257/test> explain select id, name, author from books where price > 50.0;

        tree |                   field                   |                                            description

    -------+---------------------+-------------------------------------------------

                   | distribution                   | full

                   | vectorized                        | false

        scan |                                                  |

                   | estimated row count | 1

                   | table                                   | books@books_id_name_author_idx (partial index)

                   | spans                                   | FULL SCAN

    (6 rows)

    Time: 2ms total (execution 1ms / network 1ms)

Partial indexes improve the query performance in the following ways:

  • They contain fewer rows than full indexes. During read queries, only rows in the partial index are scanned, if there is a match in the filtering condition. Since partial indexes contain fewer rows compared to regular indexes, we will be scanning fewer rows, so it performs better than a regular index.
  • Write queries on tables with a partial index only perform an index write when the rows inserted satisfy the partial index predicate, unlike regular indexes, which are updated during every write.

In the next section, we will learn about spatial indexes.

Spatial indexes

Spatial indexes were introduced in the v20.2.16 version, are used to store information about spatial objects, and mostly work with two-dimensional data types such as GEOMETRY and GEOGRAPHY. A spatial object holds information about a geographical location in the form of an object. Here, an object can be a point, a line, a polygon, or an area.

A spatial index is a special type of inverted index. A spatial index maps from a cell in a quadtree to one or more shapes whose coverings include that cell. Each cell can be part of multiple shapes, where a given cell represents a location.

Spatial indexes are useful in the following situations:

  • We are filtering based on spatial predicate functions, for example, ST_COVERS(*), ST_CONTAINS, ST_Equals, ST_Overlaps, and so on.
  • Joins that involve columns that store spatial objects.

CockroachDB uses the S2 geometry library (https://s2geometry.io/) to divide the space being indexed and stores the information in a quadtree data structure.

A quadtree is a tree data structure in which each internal node has exactly four children. Each cell in a quadtree has information about four child cells in the next level. In the following diagram, you can see an example of how an image can be represented using a quadtree data structure:

Figure 6.2 - Image representation using a Quadtree data structure

Figure 6.2 - Image representation using a Quadtree data structure

Following are some examples of creating spatial indexes.

First, let's create a table with GEOGRAPHY and GEOMETRY columns:

root@localhost:26258/test> CREATE TABLE geo_table (

     id UUID PRIMARY KEY,

     geog GEOGRAPHY(GEOMETRY,4326) NULL,

     geom GEOMETRY(GEOMETRY,3857) NULL

);

CREATE TABLE

Time: 151ms total (execution 149ms / network 2ms)

Following is an example of creating a spatial index on a GEOMETRY object with default settings:

root@localhost:26258/test> CREATE INDEX geom_idx_1 ON geo_table USING GIST(geom);

CREATE INDEX

Time: 1.647s total (execution 0.137s / network 1.511s)

Following is an example of creating a spatial index on a GEOGRAPHY object with default settings:

root@localhost:26258/test> CREATE INDEX geog_idx_1 ON geo_table USING GIST(geog);

CREATE INDEX

Time: 1.709s total (execution 0.144s / network 1.564s)

Fine-tuning spatial indexes is beyond the scope of this book and will be covered in subsequent editions.

Next, we will learn how to improve the performance of queries that involve table joins.

Table joins and indexes

Indexes are useful even when you are joining tables. You can inspect the fields that are used in filtering conditions and create appropriate indexes to avoid a full scan of other indexes.

For example, let's look at two tables: customers and purchase_orders.

The customers table stores the information about the customers as follows:

root@localhost:26258/test> CREATE TABLE customers (

         id UUID PRIMARY KEY,

         name STRING,

         email STRING,

         phone STRING

);

CREATE TABLE

Time: 209ms total (execution 209ms / network 0ms)

The purchase_orders table stores information about the purchase orders made by the customers. Here, the customer_id column references the id column of the customers table:

root@localhost:26258/test> CREATE TABLE purchase_orders (

          id UUID PRIMARY KEY,

          customer_id UUID NOT NULL REFERENCES customers ( id ),

          n_of_items INT,

          total_price DECIMAL(10,2)

);

CREATE TABLE

Time: 843ms total (execution 264ms / network 579ms)

Now, let's say you want to know the name of all the customers who have purchased more than five items. Following is the explain plan for this:

root@localhost:26258/test> explain select n_of_items, name from purchase_orders    INNER JOIN customers ON purchase_orders.customer_id = customers.id and n_of_items > 5;

               tree              |                   field                    |               description

-----------------+---------------------+--------------------------

                                        | distribution                   | full

                                        | vectorized                        | false

    hash join              |                                                  |

     │                              | equality                             | (customer_id) = (id)

     │                              | right cols are key    |

     ├── filter         |                                                  |

     │         │                   | filter                                  | n_of_items > 5

     │         └── scan |                                                  |

     │                              | estimated row count | 1

     │                              | table                                   | purchase_orders@primary

     │                              | spans                                   | FULL SCAN

     └── scan              |                                                  |

                                        | estimated row count | 1

                                        | table                                   | customers@primary

                                        | spans                                   | FULL SCAN

As you can see, we are making use of primary indexes on both the customers and purchase_orders tables. But we also have a filtering condition for which we are using the column n_of_items. We can further improve the query performance by adding one more index for the n_of_items column:

CREATE INDEX ON purchase_orders (n_of_items) STORING (customer_id);

Now, let's once again look at the explain plan for the previous query:

root@localhost:26258/test> explain select n_of_items, name from purchase_orders    INNER JOIN customers ON purchase_orders.customer_id = customers.id and n_of_items > 5;

         tree         |                   field                   |                                            description

------------+---------------------+-------------------------------------------------

                             | distribution                   | full

                             | vectorized                        | false

    hash join |                                                  |

     │                   | equality                             | (customer_id) = (id)

     │                   | right cols are key    |

     ├── scan |                                                  |

     │                   | estimated row count | 1

     │                   | table                                   | purchase_orders@purchase_orders_n_of_items_idx

     │                   | spans                                   | [/6 - ]

     └── scan |                                                  |

                             | estimated row count | 1

                             | table                                   | customers@primary

                             | spans                                   | FULL SCAN

As you can see, now we no longer do a full scan of the purchase_orders@primary index. So, based on the columns used in filtering conditions during table joins, you can create appropriate indexes.

Next, we will go over some of the best practices to consider related to indexes that can improve the query performance.

Best practices while using indexes

Whenever you are using indexes, you can follow certain guidelines to make sure you get the best performance for your queries. Following are some of the key points to remember:

  • Avoid creating an index on a sequence. Due to the nature of how the columns are sharded, sometimes we can have range hotspots, where most of the requests are coming for the same range, which can slow down the query. It would be best to use UUIDs or randomly generated keys. If you must create an index on a column that is sequential in nature, you should use hash-sharded indexes, as discussed in the Hash-sharded indexes section.
  • If you are using multiple columns in your WHERE clause or in your ORDER BY clause, you should consider creating an index for all these columns.
  • In your WHERE clause, make sure to have filters that are more restrictive before the ones that are a bit more generic. For example, = and IN should come before LIKE,  >, !=, and so on.
  • You should drop the indexes that are not getting used. This will improve the write performance, as fewer indexes will have to be updated. Right now, there is no easy way to know the unused indexes. This requires manually going through the logical plans and identifying the indexes that are not getting used.

You can use DROP INDEX to drop a specific index.

For example, let's drop an index created previously, in the Partial indexes section:

root@localhost:26258/test> DROP INDEX books@books_id_name_author_idx;

NOTICE: the data for dropped indexes is reclaimed asynchronously

HINT: The reclamation delay can be customized in the zone configuration for the table.

DROP INDEX

Time: 1.659s total (execution 0.178s / network 1.482s)

After this, if you execute SHOW INDEXES, you should not see the books@books_id_name_author_idx index:

root@localhost:26258/test> SHOW INDEXES FROM books;

    table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit

-------------+------------+------------+--------------+-------------+-----------+---------+-----------

    books              | primary         |     false         |                             1 | rowid               | ASC               |    false    |    false

(1 row)

Time: 17ms total (execution 17ms / network 0ms)

We have discussed various types of indexes other than primary and secondary. Make sure you understand these specialized indexes and use them appropriately. If a given query is using an index but is still slow, perhaps you should investigate further to see if that index makes sense for the query or if some other type of index would better improve the performance.

You can also select a specific index in the query if you think that's going to improve the performance.

Summary

In this chapter, we learned about indexes, several special types of indexes, how they work internally, and the best practices for maximum query performance. It is important to understand the columns in your tables and query patterns and pick relevant indexes for maximum query performance.

In the next chapter, we will learn about high availability and how to deploy CockroachDB in order to achieve zero downtime and to make it highly available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset