In Chapter 9, An Overview Of Security Aspects, we learned about various mechanisms for securing SQL workloads on CockroachDB. The ever-increasing presence of software running on the cloud has led to increased attack surfaces. So, it is important to carefully examine your security posture and to take measures to fill any security gaps.
In this chapter, we will go over how to troubleshoot issues that you come across while using CockroachDB. Since this is a very broad topic, we will cover a few examples from each category. Debug logs can help in narrowing down possible trouble areas and eventually finding the root cause. In the first section, we will learn about collecting debug logs. Next, we will go through some of the common causes of connection issues. In the later sections, we will cover various topics that can cause a query to execute slowly or fail completely. We will also go over ideal resource allocation and general guidelines to be followed during an upgrade and advanced debugging options.
The following topics will be covered in this chapter:
There are some commands discussed in this chapter that require you to have CockroachDB installed. If you still haven't done so, please refer to the Technical requirements section in Chapter 2, How Does CockroachDB Work Internally?
We will first start with a discussion on collecting debug logs.
Whenever we debug issues, it's very helpful to have aggregated logs. There are tools such as Datadog that provide a single pane of glass for monitoring a CockroachDB cluster. Node logs and the admin UI provide helpful information for troubleshooting issues. Please refer to Chapter 8, Exploring the Admin User Interface, to get yourself familiar with the user interface. Logs provide detailed information about all the activities happening in a cluster. CockroachDB provides various types of log files based on the intent. CockroachDB also supports log levels and log channels. It is also possible to emit certain log messages to an external destination for further processing.
In this section, we will first discuss various log files and what information each file contains, then move on to understanding the log levels that determine how much information you want to log based on their severity. Later, we will learn about the concept of log channels, sending logs to an external resource for further processing, and collecting all the logs.
CockroachDB provides various node-level log files that specialized in certain things, such as the health of nodes, storage engine logs, or security-related logs. Following is the list of various log files that are available in CockroachDB:
Next, we will discuss log levels.
Log levels indicate the severity of a log message and based on the severity you have to decide whether they should be handled or not. In general, the log levels ERROR and FATAL should be handled in your application logic and it should take appropriate action based on the error type. Following are the four log levels in CockroachDB:
Next, we will learn about a concept called log channels.
Log channels are distinguished based on the type of operation rather than the severity. A log channel is useful when you have multiple teams that have to look at different sets of logs based on the team's responsibilities. For example, logs related to configuration changes done by an admin user in production will be more useful to an infosec or security team than an application developer team. Log channels are very useful when integrated with a sink such as Slack. A sink is an external resource that is capable of receiving data. For example, all OPS and HEALTH messages can go to a Site Reliability Engineering (SRE) channel, whereas USER_ADMIN, PRIVILEGES, and SENSITIVE_ACCESS can go to the infosec channel. Following is a list of available log channels in CockroachDB:
Next, we will discuss sending logs to an external sink.
A sink is an external resource that is capable of receiving data. This external resource sits outside of CockroachDB and is deployed and maintained separately. It is possible to route messages of one or more log channels to an external log sink, which can be used for alerting, providing an aggregated view of the logs across all the nodes in the cluster, and further analyzing the log data, which can be useful in automating responses to unexpected or fatal events. Sinks include log files, Fluentd compatible servers, HTTP servers, and standard error streams. With appropriate configuration, it is possible to redact sensitive data while sending it to the sink. Sinks support the following parameters:
Additionally, it is advisable to go through technical advisories that report major issues in CockroachDB related to security and correctness. These can be accessed at https://www.cockroachlabs.com/docs/advisories/index.html. Also, Cockroach Labs keeps updating newly seen issues and resolution details at https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#.
cockroach debug zip connects to the cluster and collects the debug information from all the nodes in the cluster. So, when you start troubleshooting, it would be helpful to run this to collect all the relevant logs. Following is the syntax of the debug command:
cockroach debug zip <path_to_store_the_zip_file> { flags }
The following is an example of the debug command:
$ cockroach debug zip ./cockroach-db/logs/debug.zip --insecure --host=locahost:26258
Following are some of the important options that you have with the cockroach debug zip command:
For complete options, please check cockroach debug zip --help.
Information such as CPU usage, metrics, schema metadata, node health, stack trace, events, and jobs are collected from all the nodes.
Next, we will learn about connection issues and how to ensure we don't run into one.
Connection refused is one of the common connection issues when a PostgreSQL-compatible SQL client is unable to connect to a CockroachDB node. There can be many things that can go wrong here, such as expired certificates or firewall settings. Following are some general guidelines on what to look for in terms of connection related configurations:
In case of connection issues, we have to make sure that the client and the nodes of a cluster can talk to each other securely.
In the next section, we will go over some of the recommended hardware and software configurations for CockroachDB nodes.
It is important to know which queries are not performing as expected so that we can further investigate to find the actual cause for the slowness. Some queries can become slow when the data grows. So, it is important to benchmark some of the critical queries over time to make sure they are getting executed within the expected time. There are three major ways to identify and debug slow queries:
> SET CLUSTER SETTING sql.log.slow_query.latency_threshold = '500ms';
SET CLUSTER SETTING
Time: 120ms total (execution 120ms / network 0ms)
A slow query can be caused by various issues that are discussed in the rest of the sections in this chapter. So, once you identify a slow query, you have to identify which part of the query is slow and investigate further till you find all possible reasons.
Next, we are going to go over general guidelines for upgrading a CockroachDB cluster.
We have to make sure individual nodes and the entire cluster has optimal hardware and software configuration in order to perform better. Following is a set of recommended configurations for CockroachDB:
VM capacities differ between cloud providers. So, based on the preceding recommendations, you can choose the appropriate VM for your workloads.
It's important to monitor the CPU and storage metrics for all the nodes and to have some automation in place in order to expand the cluster by adding more nodes. Although there is no hard limit, you might be better off keeping the maximum number of nodes in a cluster to 50. Once you hit this threshold, you can add a second separate cluster and introduce an external sharding mechanism to decide which cluster to redirect to for a given query.
In the next section, we will go through configuration-related issues.
Things can go wrong because of misconfigurations such as using unavailable ports and configuring the incorrect storage directory at the start. CockroachDB provides several cluster-level settings. In most cases, leaving the default values might work better. If you are modifying any default value, please ensure it's tested at scale in a pre-production environment. Following are some of the commonly seen configuration-related issues:
There are many other issues that you can face related to configuration. Based on the error message, you have to further investigate and decide the appropriate solution.
In the next section, we will discuss slow queries.
When upgrading your cluster, the following are some general guidelines:
SET CLUSTER SETTING cluster.preserve_downgrade_option = '<current_cluster_version>';
However, you have to make sure you manually re-enable auto-finalization after ensuring the cluster is stable and there is no data corruption. The following command can be used for re-enabling auto-finalization:
RESET CLUSTER SETTING cluster.preserve_downgrade_option;
In the next section, we will discuss network latency.
In the admin user interface, you have a page that shows network latencies within the cluster. It has information about the round-trip latencies between each pair of nodes in the cluster as shown in the following screenshot.
Network latency plays an important role when you are deciding on your replication strategy and topology patterns. Please refer to Chapter 4, Geo-Partitioning, to go over all the topology patterns available in CockroachDB. Based on the criticality of the data, you also have to decide whether you can tolerate zonal or regional failures.
Also, even before you decide the region and availability zone for CockroachDB nodes, it would be good to go through inter-region network latencies for different cloud providers. If you are planning to have your CockroachDB cluster span across multiple cloud providers, you may want to choose the same region for your second and third cloud provider in order to ensure the least network latency as they will be geographically co-located. Companies such as Aviatrix also provide inter-cloud latencies for various regions.
In the last section, we will go over some of the advanced debugging options.
In the admin user interface, you have a page for advanced debugging, which, as the name indicates, shows advanced information about a cluster that can be useful in troubleshooting issues. The Advanced Debugging page has the following sections:
The following screenshot shows how the Advanced Debugging page looks on the user interface:
Under Reports you can get a report for the following items:
The Configuration section shows the following items:
The Even More Advanced Debugging section is meant for the developers of CockroachDB who understand the internal implementation. It has a lot of information, is too detailed, and in general, can be avoided while debugging issues unless you understand the internal implementations of CockroachDB.
In this chapter, we went through several resources that are available for debugging and troubleshooting issues. Getting yourself familiarized with the debug logs and admin user interface is an important step if you want to quickly diagnose and root cause the issues. Also, being aware of cluster configurations, deployment strategies, network latencies, and security settings comes in handy while troubleshooting issues. Using tools for log aggregation, distributing tracing, tracking spans, alerting, and advanced metrics can further aid you.
In the next chapter, we will learn about performance benchmarking and migration from other databases to CockroachDB.