Chapter 10: Troubleshooting Issues

In Chapter 9, An Overview Of Security Aspects, we learned about various mechanisms for securing SQL workloads on CockroachDB. The ever-increasing presence of software running on the cloud has led to increased attack surfaces. So, it is important to carefully examine your security posture and to take measures to fill any security gaps.

In this chapter, we will go over how to troubleshoot issues that you come across while using CockroachDB. Since this is a very broad topic, we will cover a few examples from each category. Debug logs can help in narrowing down possible trouble areas and eventually finding the root cause. In the first section, we will learn about collecting debug logs. Next, we will go through some of the common causes of connection issues. In the later sections, we will cover various topics that can cause a query to execute slowly or fail completely. We will also go over ideal resource allocation and general guidelines to be followed during an upgrade and advanced debugging options.

The following topics will be covered in this chapter:

  • Collecting debug logs
  • Connection issues
  • Tracking slow queries
  • Capacity planning
  • Configuration issues
  • Network latency
  • Guidelines to avoid issues during upgrade
  • Advanced debugging options

Technical requirements

There are some commands discussed in this chapter that require you to have CockroachDB installed. If you still haven't done so, please refer to the Technical requirements section in Chapter 2, How Does CockroachDB Work Internally?

We will first start with a discussion on collecting debug logs.

Collecting debug logs

Whenever we debug issues, it's very helpful to have aggregated logs. There are tools such as Datadog that provide a single pane of glass for monitoring a CockroachDB cluster. Node logs and the admin UI provide helpful information for troubleshooting issues. Please refer to Chapter 8, Exploring the Admin User Interface, to get yourself familiar with the user interface. Logs provide detailed information about all the activities happening in a cluster. CockroachDB provides various types of log files based on the intent. CockroachDB also supports log levels and log channels. It is also possible to emit certain log messages to an external destination for further processing.

In this section, we will first discuss various log files and what information each file contains, then move on to understanding the log levels that determine how much information you want to log based on their severity. Later, we will learn about the concept of log channels, sending logs to an external resource for further processing, and collecting all the logs.

Log files

CockroachDB provides various node-level log files that specialized in certain things, such as the health of nodes, storage engine logs, or security-related logs. Following is the list of various log files that are available in CockroachDB:

  • cockroach.log: A general log that contains information about all the major events occurring within the cluster.
  • cockroach-health.log: The health of the cluster.
  • cockroach-security.log: SQL security log.
  • cockroach-sql-audit.log: SQL access audit log.
  • cockroach-sql-auth.log: SQL authentication log.
  • cockroach-sql-exec.log: SQL execution log.
  • cockroach-sql-slow.log: SQL slow query log.
  • cockroach-sql-schema.log: SQL schema change log.
  • cockroach-pebble.log: Pebble key-value store log. Pebble is the default storage engine in CockroachDB.
  • cockroach-telemetry.log: Telemetry log, which contains event information about feature usage.

Next, we will discuss log levels.

Log levels

Log levels indicate the severity of a log message and based on the severity you have to decide whether they should be handled or not. In general, the log levels ERROR and FATAL should be handled in your application logic and it should take appropriate action based on the error type. Following are the four log levels in CockroachDB:

  • INFO: This indicates general informational log messages. It might be prudent to turn these off in a production environment as they can occupy a lot of space and are not that relevant when debugging issues.
  • WARNING: This indicates that a normal operation might have failed but will recover automatically. Based on each case and the impact, we have to decide whether to handle this or not.
  • ERROR: This indicates that a normal operation cannot be performed. We should pay attention to these errors and make sure they are handled properly.
  • FATAL: Fatal errors need immediate attention and action as they will be of the highest severity.

Next, we will learn about a concept called log channels.

Log channels

Log channels are distinguished based on the type of operation rather than the severity. A log channel is useful when you have multiple teams that have to look at different sets of logs based on the team's responsibilities. For example, logs related to configuration changes done by an admin user in production will be more useful to an infosec or security team than an application developer team. Log channels are very useful when integrated with a sink such as Slack. A sink is an external resource that is capable of receiving data. For example, all OPS and HEALTH messages can go to a Site Reliability Engineering (SRE) channel, whereas USER_ADMIN, PRIVILEGES, and SENSITIVE_ACCESS can go to the infosec channel. Following is a list of available log channels in CockroachDB:

  • DEV: This channel is used during development and everything gets logged.
  • OPS: This channel is related to cluster relation operations, configurations, and jobs.
  • HEALTH: This channel logs resource usage, connection errors, range availability, and leasing events.
  • STORAGE: This is used to log Pebble storage engine related events.
  • SESSIONS: This covers sessions, connections, and authentication events.
  • SQL_SCHEMA: This is used to track schema changes involving database, schema, table, sequence, view, and type; metadata changes.
  • USER_ADMIN: This is used to track changes in users and roles.
  • PRIVILEGES: This is used to track changes in grants and object ownership.
  • SENSITIVE_ACCESS: This is used to access data access audit events, SQL statements by admin, and operations that write to system tables.
  • SQL_EXEC: This channel logs SQL executions and errors during execution.
  • SQL_PERF: This channel records events that affect performance and slow query logs.
  • SQL_INTERNAL_PERF: This logs internal details of CockroachDB that will be useful during troubleshooting.
  • TELEMETRY: This is used to log telemetry events.

Next, we will discuss sending logs to an external sink.

Emitting logs to an external sink

A sink is an external resource that is capable of receiving data. This external resource sits outside of CockroachDB and is deployed and maintained separately. It is possible to route messages of one or more log channels to an external log sink, which can be used for alerting, providing an aggregated view of the logs across all the nodes in the cluster, and further analyzing the log data, which can be useful in automating responses to unexpected or fatal events. Sinks include log files, Fluentd compatible servers, HTTP servers, and standard error streams. With appropriate configuration, it is possible to redact sensitive data while sending it to the sink. Sinks support the following parameters:

  • filter: The minimum severity level.
  • format: The log format.
  • redact: If true, redacts sensitive data such as personally identifiable information (PII) from log messages.
  • redactable: Retains redaction markers around sensitive fields.
  • exit-on-error: If true, stops the CockroachDB node if it's unable to send log messages to the sink.
  • auditable : If true, enables exit-on-error and disables buffered-writing and hence enforces stopping the CockroachDB node.
  • If it's unable to send logs to the sink, it flushes each entry and synchronizes the writes.

Additionally, it is advisable to go through technical advisories that report major issues in CockroachDB related to security and correctness. These can be accessed at https://www.cockroachlabs.com/docs/advisories/index.html. Also, Cockroach Labs keeps updating newly seen issues and resolution details at https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#.

Gathering Cockroach debug logs

cockroach debug zip connects to the cluster and collects the debug information from all the nodes in the cluster. So, when you start troubleshooting, it would be helpful to run this to collect all the relevant logs. Following is the syntax of the debug command:

cockroach debug zip <path_to_store_the_zip_file> { flags }

The following is an example of the debug command:

$ cockroach debug zip ./cockroach-db/logs/debug.zip --insecure --host=locahost:26258

Following are some of the important options that you have with the cockroach debug zip command:

  • --redact-logs: Redacts sensitive information from the logs.
  • --timeout: Times out the command with an error message if it doesn't complete within the stipulated time.  
  • --files-from and --files-until: You can filter the logs based on the time. For example, --files-from='2021-01-01 12:00' and files-until='2021-06-30 12:00'.

For complete options, please check cockroach debug zip --help.

Information such as CPU usage, metrics, schema metadata, node health, stack trace, events, and jobs are collected from all the nodes.

Next, we will learn about connection issues and how to ensure we don't run into one.

Connection issues

Connection refused is one of the common connection issues when a PostgreSQL-compatible SQL client is unable to connect to a CockroachDB node. There can be many things that can go wrong here, such as expired certificates or firewall settings. Following are some general guidelines on what to look for in terms of connection related configurations:

  • Check that the CockroachDB node is running and listening on the correct port.
  • Check the connection details, especially the hostname and port number.
  • If the node is running in secure mode, you have to make sure appropriate certificates and keys are generated and are being passed correctly during the connection.
  • Make sure the client certificate is present and not expired.
  • If the node is already running and the host and port are correct, you can try restarting the node and see if that helps with the connection issue.
  • Check for firewall rules that prohibit specific inbound and outbound traffic.
  • Make sure the node is reachable using ping and the port can be accessed using debugging tools such as netcat or telnet.

In case of connection issues, we have to make sure that the client and the nodes of a cluster can talk to each other securely.

In the next section, we will go over some of the recommended hardware and software configurations for CockroachDB nodes.

Tracking slow queries

It is important to know which queries are not performing as expected so that we can further investigate to find the actual cause for the slowness. Some queries can become slow when the data grows. So, it is important to benchmark some of the critical queries over time to make sure they are getting executed within the expected time. There are three major ways to identify and debug slow queries:

  • Use EXPLAIN and see if the query involves full scans and if so, see what kind of indexes you can create to avoid it. This holds true for table joins as well.
  • Turn on the slow query log for a specific latency threshold. In the following example, all queries whose latency is greater than or equal to 500 milliseconds will be logged in the slow query log:

    > SET CLUSTER SETTING sql.log.slow_query.latency_threshold = '500ms';

    SET CLUSTER SETTING

    Time: 120ms total (execution 120ms / network 0ms)

  • Integrate CockroachDB logging with tools such as OpenTelemetry and Jaeger that provide advanced tracing and spanning capabilities. This helps us with understanding all the network hops that happen during a query and also how much time individual units of work are taking. Whichever function or network hop takes more time could be a potential candidate to investigate further.

A slow query can be caused by various issues that are discussed in the rest of the sections in this chapter. So, once you identify a slow query, you have to identify which part of the query is slow and investigate further till you find all possible reasons.  

Next, we are going to go over general guidelines for upgrading a CockroachDB cluster.

Capacity planning

We have to make sure individual nodes and the entire cluster has optimal hardware and software configuration in order to perform better. Following is a set of recommended configurations for CockroachDB:

  • Operating system: Container-optimized OSs such as Ubuntu, Red Hat Linux, and CentOS are the preferred operating systems. A container-optimized OS, as the name suggests, is an operating system that is optimized for running Docker containers.
  • Node configuration: Following are node configurations for optimal performance:
    • CPU: At least four vCPUs per node. For more throughput per node, you can increase the number of vCPUs.   
    • Memory: 4 GiB RAM per vCPU.
    • IOPS (input/output operations per second): 500 per vCPU.
    • Disk I/O in MiB/second: 30 MiB/second per vCPU.
  • Storage
    • 150 GiB per vCPU.
    • 2.5 TiB max per node.
    • Dedicated volumes only used for CockroachDB are preferred.

VM capacities differ between cloud providers. So, based on the preceding recommendations, you can choose the appropriate VM for your workloads.

It's important to monitor the CPU and storage metrics for all the nodes and to have some automation in place in order to expand the cluster by adding more nodes. Although there is no hard limit, you might be better off keeping the maximum number of nodes in a cluster to 50. Once you hit this threshold, you can add a second separate cluster and introduce an external sharding mechanism to decide which cluster to redirect to for a given query.

In the next section, we will go through configuration-related issues.

Configuration issues

Things can go wrong because of misconfigurations such as using unavailable ports and configuring the incorrect storage directory at the start. CockroachDB provides several cluster-level settings. In most cases, leaving the default values might work better. If you are modifying any default value, please ensure it's tested at scale in a pre-production environment. Following are some of the commonly seen configuration-related issues:

  • Storage directory already exists: This happens when you try to start a node with a storage directory that has been used for some other CockroachDB process. In this case, you can either choose a different directory or delete the contents of that current directory.
  • Port is already in use: This happens when there is some other process that is already using a given port with which you are trying to start a CockroachDB process. You can either kill the process that is using that port, provided that process is no longer required, or you can also pick a different port that is available.
  • Clock synchronization error: Whenever the clock of some node goes out of sync with at least half of the other nodes in the cluster by more than 500 milliseconds (the default threshold), the node shuts itself down. In order to avoid this, you can try using an external network time protocol (NTP) service.
  • Open file descriptor limit: Since CockroachDB opens a large number of file descriptors, it expects a node to have a certain threshold limit, which is 1,956. The recommendation is to set the limit to unlimited.

There are many other issues that you can face related to configuration. Based on the error message, you have to further investigate and decide the appropriate solution.

In the next section, we will discuss slow queries.

Guidelines to avoid issues during an upgrade

When upgrading your cluster, the following are some general guidelines:

  • Before you upgrade to a new version, it would be good to wait at least 2–3 months to make sure it's stable enough, doesn't contain too many bugs, and is not withdrawn for security or functional issues.
  • Once you decide the version you want to upgrade to, you have to make sure you can upgrade to that version from the current version after going through the release notes. Sometimes you might have to upgrade multiple times to reach the desired version.
  • For certain versions, it's not possible to downgrade back to the previous version if things go wrong and the upgrade is auto-finalized. In that case, your only option is to discard the current cluster and create a new cluster from the backups. So make sure all the data is backed up before you start the upgrade. Also, as an alternate solution, you can disable auto-finalization with the following command:

    SET CLUSTER SETTING cluster.preserve_downgrade_option = '<current_cluster_version>';

However, you have to make sure you manually re-enable auto-finalization after ensuring the cluster is stable and there is no data corruption. The following command can be used for re-enabling auto-finalization:

RESET CLUSTER SETTING cluster.preserve_downgrade_option;

  • You can perform a rolling upgrade by upgrading one node at a time and letting the upgraded node rejoin the original cluster. Before you upgrade the next node, make sure the cluster is healthy.
  • Make sure you are not decommissioning any nodes during the upgrade. This can result in multiple failures.
  • It is also advisable not to do any schema changes.
  • After the upgrade is complete, you can execute the Show Cluster Setting version to ensure it shows the latest upgraded version.
  • If anything goes wrong during the upgrade, your best option is to run the 'cockroach debug zip' command in order to gather all the information about the cluster and go through the failures.

In the next section, we will discuss network latency.

Network latency

In the admin user interface, you have a page that shows network latencies within the cluster. It has information about the round-trip latencies between each pair of nodes in the cluster as shown in the following screenshot.

Figure 10.1 – Network latency page in the admin user interface

Figure 10.1 – Network latency page in the admin user interface

Network latency plays an important role when you are deciding on your replication strategy and topology patterns. Please refer to Chapter 4, Geo-Partitioning, to go over all the topology patterns available in CockroachDB. Based on the criticality of the data, you also have to decide whether you can tolerate zonal or regional failures.

Also, even before you decide the region and availability zone for CockroachDB nodes, it would be good to go through inter-region network latencies for different cloud providers. If you are planning to have your CockroachDB cluster span across multiple cloud providers, you may want to choose the same region for your second and third cloud provider in order to ensure the least network latency as they will be geographically co-located. Companies such as Aviatrix also provide inter-cloud latencies for various regions.

In the last section, we will go over some of the advanced debugging options.

Advanced debugging options

In the admin user interface, you have a page for advanced debugging, which, as the name indicates, shows advanced information about a cluster that can be useful in troubleshooting issues. The Advanced Debugging page has the following sections:

  • Reports
  • Configuration
  • Even more advanced debugging

The following screenshot shows how the Advanced Debugging page looks on the user interface:

Figure 10.2 – Advanced Debugging page in the admin user interface

Figure 10.2 – Advanced Debugging page in the admin user interface

Under Reports you can get a report for the following items:

  • Custom Time Series Chart: You can create a custom chart of the time series data.
  • Problem Ranges: You can view ranges in your cluster that are unavailable, under-replicated, slow, or have other problems.
  • Data Distribution and Zone Configs: You can view the distribution of table data across nodes and verify the zone configuration.
  • Statement Diagnostics History: You can view the history of statement diagnostics requests.

The Configuration section shows the following items:

  • Cluster Settings: You can check all the cluster-level configurations here.
  • Localities: This shows node locality, including address and location.

The Even More Advanced Debugging section is meant for the developers of CockroachDB who understand the internal implementation. It has a lot of information, is too detailed, and in general, can be avoided while debugging issues unless you understand the internal implementations of CockroachDB.

Summary

In this chapter, we went through several resources that are available for debugging and troubleshooting issues. Getting yourself familiarized with the debug logs and admin user interface is an important step if you want to quickly diagnose and root cause the issues. Also, being aware of cluster configurations, deployment strategies, network latencies, and security settings comes in handy while troubleshooting issues. Using tools for log aggregation, distributing tracing, tracking spans, alerting, and advanced metrics can further aid you.

In the next chapter, we will learn about performance benchmarking and migration from other databases to CockroachDB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset