In Chapter 5, Fault Tolerance and Auto-Rebalancing, we learned about fault-tolerance and auto-recovery strategies in CockroachDB. In this chapter, we will learn everything about indexes, what they are, and how they improve query times.
Although indexes help improve the read performance, a wrong index can slow down the queries, including the writes, and take up more storage space. So, it is important to identify the query pattern and create appropriate indexes.
The following topics will be covered in this chapter:
We will try creating various types of indexes in this chapter which would require you to have CockroachDB installed. If you still haven't done so, please refer to the Technical requirements section in Chapter 2, How Does CockroachDB Work Internally?.
An index or a database index helps with returning the query results quickly, by avoiding full table scans. An index can be created for a specific table and can include one or more keys. Keys refer to the columns in the table. However, there will be extra space used to keep a separate sorted copy of indexed columns.
Let's take a simple example and see how an index works.
Consider a population table with the following columns and some sample values:
Now, let's say you just want to retrieve the list of populations for specific continents, for example:
SELECT population_in_millions, country FROM population WHERE continent = "Asia";
Here, in order to find rows 1 and 4, which are countries in Asia, you would have to iterate through each of the rows in the table, which is called a full table scan.
Now, if you want to avoid a full table scan, you can create an index on the continent as follows:
CREATE INDEX ON population (continent);
Internally, CockroachDB keeps track of all the continents and keeps a mapping from a given continent to all its relevant rows, as shown here:
Africa -> (3)
Asia -> ( 1, 4 )
Europe -> ( 5 )
North America -> (2 )
Now, if you run the same query, SELECT population_in_millions, country FROM population WHERE continent = "Asia", once again, CockroachDB will identify that the filtering condition has the column continent and there is an index already available for that. So, in this case, based on the value Asia, it will directly retrieve rows (1 , 4) from the continent index and get relevant column values and return them. In this case, it has avoided a full table scan. Although this example only has five rows, the same concept is applicable even when a table contains millions of rows. So, in such cases, avoiding a full table scan can significantly improve the query performance. At the same time, writes tend to get a little slower as, with each write, even the index must be updated.
When you create an index on a column or a set of columns, CockroachDB internally makes a copy of the values of that column or columns and sorts them. So, whenever you execute a query that involves filters on indexed columns, a subset of rows is selected from the index first, rather than scanning the entire table. This improves the overall query performance.
Next, we are going to discuss various types of indexes that are available in CockroachDB.
Based on the query pattern and columns in the table, you should decide what kind of index is going to help with the performance.
The following are the types of indexes available in CockroachDB:
In the next set of subsections, we are going to discuss each type of index and when to use them, starting with the primary index.
A primary key uniquely identifies a given row in a table. This means that the primary key is unique for a given row and duplicate values or NULLs are not allowed. An index created for a primary key is called a primary index.
Whenever you create a table in CockroachDB, it's recommended to have an explicit primary key, so that CockroachDB automatically creates an index for it, which can be used to filter the rows for better performance. Even if you don't create a primary key during table creation, CockroachDB by default creates a primary key called rowid, which will have a unique value for each row, but its performance will not be as good as that of the primary key.
Let's understand how indexes work with an example, where we are going to create a database and a table with a primary key:
root@localhost:26257/defaultdb> CREATE DATABASE IF NOT EXISTS test;
CREATE DATABASE
Time: 279ms total (execution 279ms / network 0ms)
Switch to the database 'test'.
root@localhost:26257/defaultdb> use test;
SET
Time: 116ms total (execution 115ms / network 0ms)
root@localhost:26257/test> CREATE TABLE accounts (
id UUID PRIMARY KEY,
name string,
balance INT8
);
CREATE TABLE
Time: 195ms total (execution 195ms / network 0ms)
SHOW INDEXES FROM accounts;
table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+------------+------------+--------------+-------------+-----------+---------+-----------
accounts | primary | false | 1 | id | ASC | false | false
(1 row)
Time: 7ms total (execution 7ms / network 0ms)
In the next section, we will learn about hash-sharded indexes that are used in relation to sequences.
root@localhost:26257/test> explain select * from accounts;
tree | field | description
------------+-----------------------+-----------------
| distribution | full
| vectorized | false
scan | |
| estimated row count | 1
| table | accounts@primary
| spans | FULL SCAN
(6 rows)
root@localhost:26257/test> explain select * FROM accounts where id = '123e4567-e89b-12d3-a456-426655440000';
tree | field | description
----------+-----------------------+-------------------
| distribution | local
| vectorized | false
scan | |
| estimated row count | 1
| table | accounts@primary
| spans | [/'123e4567-e89b-12d3-a456-426655440000' - /'123e4567-e89b-12d3-a456-426655440000']
(6 rows)
If multiple columns are used in queries, you should also consider creating a composite primary key that includes all the columns that are often used together.
In the next section, we will learn about secondary indexes.
A secondary index is an index that you create on non-primary columns. If your query involves retrieving a column that's not a primary key and you want to improve the query performance, you can create secondary indexes. Any index that you create on a non-primary key is called a secondary index, and duplicate values are allowed for secondary indexes. For the test.accounts table, if the query contains a non-primary column such as name, then we would still need a full table scan. Let's try this with an example, where we will just use a non-primary column in the query:
root@localhost:26257/test> explain select name FROM accounts where name = 'crdb' ;
tree | field | description
--------------+-----------------------+-------------------
| distribution | full
| vectorized | false
filter | |
│ | filter | name = 'crdb'
└── scan | |
| estimated row count | 1
| table | accounts@primary
| spans | FULL SCAN
(8 rows)
Since we are now filtering on a non-primary column, CockroachDB must inspect each row and apply a filtering condition, and the index on the primary key id doesn't help here. So, let's create one more index on the column name:
root@localhost:26257/test> CREATE INDEX on accounts ( name );
CREATE INDEX
Time: 2.053s total (execution 0.256s / network 1.797s)
Whenever you create a secondary index, CockroachDB automatically creates a composite index including the primary key. Also, the index on the column name is called a secondary index:
root@localhost:26257/test> show indexes from accounts;
table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+-------------------+------------+--------------+-------------+-----------+---------+-----------
accounts | primary | false | 1 | id | ASC | false | false
accounts | accounts_name_idx | true | 1 | name | ASC | false | false
accounts | accounts_name_idx | true | 2 | id | ASC | false | true
(3 rows)
Now if you run the previous query, you should see that the full table scan is avoided because of the new index that we have created:
root@localhost:26257/test> explain select name from accounts where name = 'crdb' ;
tree | field | description
----------+--------------------+--------------------
| distribution | local
| vectorized | false
Scan | |
| estimated row count| 1
| table | accounts@accounts_name_idx
| spans | [/'crdb' - /'crdb']
(6 rows)
Time: 1ms total (execution 1ms / network 0ms)
Hash-sharded indexes can improve query performance when you must create an index on a column that's a sequence. Hash-sharded indexes evenly spread the traffic to a sequential range across multiple ranges to avoid hotspots for any given range. Since this is a new experimental feature, the implementation and overall performance might change over time. Let's begin:
root@localhost:26257/test> set experimental_enable_hash_sharded_indexes = ON;
SET
Time: 1ms total (execution 0ms / network 0ms)
root@localhost:26257/test> create TABLE customers ( id int PRIMARY KEY, name string);
CREATE TABLE
Time: 160ms total (execution 160ms / network 0ms)
root@localhost:26257/test> ALTER TABLE customers ALTER PRIMARY KEY USING COLUMNS (id) USING HASH WITH BUCKET_COUNT = 10;
NOTICE: primary key changes are finalized asynchronously; further schema changes on this table may be restricted until the job completes
ALTER TABLE
Time: 4.551s total (execution 0.297s / network 4.253s)
When you create a hash-sharded index, CockroachDB creates n_buckets computed columns, shards the primary index ID into n_buckets number of shards, and then stores each index shard in the underlying key-value store with one of the computed column's hashes as its prefix.
root@localhost:26257/test> show indexes from customers;
table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+------------------+------------+--------------+-----------------------------+-----------+---------+-----------
customers | primary | false | 1 | crdb_internal_id_shard_5000 | ASC | false | false
customers | primary | false | 2 | id | ASC | false | false
customers | customers_id_key | false | 1 | id | ASC | false | false
customers | customers_id_key | false | 2 | crdb_internal_id_shard_5000 | ASC | false | true
(4 rows)
Time: 34ms total (execution 26ms / network 8ms)
You can create a hash-sharded secondary index as well.
Duplicate indexes improve the read performance. Please refer to Chapter 4, Geo-Partitioning, where we discussed duplicate indexes and how they work internally.
Next, we will learn about inverted indexes.
Inverted indexes store the mapping of values within JSONB, arrays, and spatial data to the row that holds that value. For example, if you have a column where you are storing a JSON document, and let's say that JSON document contains a key called country, then you can add a WHERE clause in your query, where you can say get me all the rows that have country:USA and country:Canada.
Inverted indexes filter on components of tokenizable data. The JSONB data type is built on two structures that can be tokenized:
Let's look at the following JSON document:
"student": [
{
"id":"01",
"firstname": "Steve",
"lastname": "Jobs"
},
{
"id":"02",
"firstname": "Steve",
"lastname": "Wozniak"
}
]
Now, the inverted index for the preceding JSON will have an entry for each component, which maps to the original document as follows:
"student" : "id" : "01"
"student" : "firstname" : "Steve"
"student" : "lastname" : "Jobs"
"student" : "id" : "02"
"student" : "lastname": "Wozniak"
Now you can search the JSON document based on student ID, student first name, student last name, and so on.
A partial index is typically created based on a Boolean expression. CockroachDB internally indexes the columns and rows that evaluate to true for a given expression.
Let's understand partial indexes with an example:
root@localhost:26257/test> create table books ( id int, title string, author string, price float );
CREATE TABLE
Time: 270ms total (execution 269ms / network 1ms)
root@localhost:26257/test> CREATE INDEX ON books (id, title, author) WHERE price > 50.00;
CREATE INDEX
Time: 2.596s total (execution 0.314s / network 2.282s)
root@localhost:26257/test> explain select id, name, author from books where price > 50.0;
tree | field | description
-------+---------------------+-------------------------------------------------
| distribution | full
| vectorized | false
scan | |
| estimated row count | 1
| table | books@books_id_name_author_idx (partial index)
| spans | FULL SCAN
(6 rows)
Time: 2ms total (execution 1ms / network 1ms)
Partial indexes improve the query performance in the following ways:
In the next section, we will learn about spatial indexes.
Spatial indexes were introduced in the v20.2.16 version, are used to store information about spatial objects, and mostly work with two-dimensional data types such as GEOMETRY and GEOGRAPHY. A spatial object holds information about a geographical location in the form of an object. Here, an object can be a point, a line, a polygon, or an area.
A spatial index is a special type of inverted index. A spatial index maps from a cell in a quadtree to one or more shapes whose coverings include that cell. Each cell can be part of multiple shapes, where a given cell represents a location.
Spatial indexes are useful in the following situations:
CockroachDB uses the S2 geometry library (https://s2geometry.io/) to divide the space being indexed and stores the information in a quadtree data structure.
A quadtree is a tree data structure in which each internal node has exactly four children. Each cell in a quadtree has information about four child cells in the next level. In the following diagram, you can see an example of how an image can be represented using a quadtree data structure:
Following are some examples of creating spatial indexes.
First, let's create a table with GEOGRAPHY and GEOMETRY columns:
root@localhost:26258/test> CREATE TABLE geo_table (
id UUID PRIMARY KEY,
geog GEOGRAPHY(GEOMETRY,4326) NULL,
geom GEOMETRY(GEOMETRY,3857) NULL
);
CREATE TABLE
Time: 151ms total (execution 149ms / network 2ms)
Following is an example of creating a spatial index on a GEOMETRY object with default settings:
root@localhost:26258/test> CREATE INDEX geom_idx_1 ON geo_table USING GIST(geom);
CREATE INDEX
Time: 1.647s total (execution 0.137s / network 1.511s)
Following is an example of creating a spatial index on a GEOGRAPHY object with default settings:
root@localhost:26258/test> CREATE INDEX geog_idx_1 ON geo_table USING GIST(geog);
CREATE INDEX
Time: 1.709s total (execution 0.144s / network 1.564s)
Fine-tuning spatial indexes is beyond the scope of this book and will be covered in subsequent editions.
Next, we will learn how to improve the performance of queries that involve table joins.
Indexes are useful even when you are joining tables. You can inspect the fields that are used in filtering conditions and create appropriate indexes to avoid a full scan of other indexes.
For example, let's look at two tables: customers and purchase_orders.
The customers table stores the information about the customers as follows:
root@localhost:26258/test> CREATE TABLE customers (
id UUID PRIMARY KEY,
name STRING,
email STRING,
phone STRING
);
CREATE TABLE
Time: 209ms total (execution 209ms / network 0ms)
The purchase_orders table stores information about the purchase orders made by the customers. Here, the customer_id column references the id column of the customers table:
root@localhost:26258/test> CREATE TABLE purchase_orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL REFERENCES customers ( id ),
n_of_items INT,
total_price DECIMAL(10,2)
);
CREATE TABLE
Time: 843ms total (execution 264ms / network 579ms)
Now, let's say you want to know the name of all the customers who have purchased more than five items. Following is the explain plan for this:
root@localhost:26258/test> explain select n_of_items, name from purchase_orders INNER JOIN customers ON purchase_orders.customer_id = customers.id and n_of_items > 5;
tree | field | description
-----------------+---------------------+--------------------------
| distribution | full
| vectorized | false
hash join | |
│ | equality | (customer_id) = (id)
│ | right cols are key |
├── filter | |
│ │ | filter | n_of_items > 5
│ └── scan | |
│ | estimated row count | 1
│ | table | purchase_orders@primary
│ | spans | FULL SCAN
└── scan | |
| estimated row count | 1
| table | customers@primary
| spans | FULL SCAN
As you can see, we are making use of primary indexes on both the customers and purchase_orders tables. But we also have a filtering condition for which we are using the column n_of_items. We can further improve the query performance by adding one more index for the n_of_items column:
CREATE INDEX ON purchase_orders (n_of_items) STORING (customer_id);
Now, let's once again look at the explain plan for the previous query:
root@localhost:26258/test> explain select n_of_items, name from purchase_orders INNER JOIN customers ON purchase_orders.customer_id = customers.id and n_of_items > 5;
tree | field | description
------------+---------------------+-------------------------------------------------
| distribution | full
| vectorized | false
hash join | |
│ | equality | (customer_id) = (id)
│ | right cols are key |
├── scan | |
│ | estimated row count | 1
│ | table | purchase_orders@purchase_orders_n_of_items_idx
│ | spans | [/6 - ]
└── scan | |
| estimated row count | 1
| table | customers@primary
| spans | FULL SCAN
As you can see, now we no longer do a full scan of the purchase_orders@primary index. So, based on the columns used in filtering conditions during table joins, you can create appropriate indexes.
Next, we will go over some of the best practices to consider related to indexes that can improve the query performance.
Whenever you are using indexes, you can follow certain guidelines to make sure you get the best performance for your queries. Following are some of the key points to remember:
You can use DROP INDEX to drop a specific index.
For example, let's drop an index created previously, in the Partial indexes section:
root@localhost:26258/test> DROP INDEX books@books_id_name_author_idx;
NOTICE: the data for dropped indexes is reclaimed asynchronously
HINT: The reclamation delay can be customized in the zone configuration for the table.
DROP INDEX
Time: 1.659s total (execution 0.178s / network 1.482s)
After this, if you execute SHOW INDEXES, you should not see the books@books_id_name_author_idx index:
root@localhost:26258/test> SHOW INDEXES FROM books;
table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+------------+------------+--------------+-------------+-----------+---------+-----------
books | primary | false | 1 | rowid | ASC | false | false
(1 row)
Time: 17ms total (execution 17ms / network 0ms)
We have discussed various types of indexes other than primary and secondary. Make sure you understand these specialized indexes and use them appropriately. If a given query is using an index but is still slow, perhaps you should investigate further to see if that index makes sense for the query or if some other type of index would better improve the performance.
You can also select a specific index in the query if you think that's going to improve the performance.
In this chapter, we learned about indexes, several special types of indexes, how they work internally, and the best practices for maximum query performance. It is important to understand the columns in your tables and query patterns and pick relevant indexes for maximum query performance.
In the next chapter, we will learn about high availability and how to deploy CockroachDB in order to achieve zero downtime and to make it highly available.