We have covered a lot of errors and problems that you need to troubleshoot in a typical OpenStack installation. In this final chapter, we want to cover some of the chronic issues that might be early signs of trouble. This chapter is more about prevention and aims to help you avoid emergency troubleshooting as much as possible.
We will be looking at the following topics in this chapter:
As we have seen throughout this book, many OpenStack services make heavy use of databases. Production deployments typically use MySQL or Postgres as a backend database server. As you have learned, a failing or misconfigured database will quickly lead to trouble in your OpenStack cluster. Database problems can also present more subtle concerns that may grow into huge problems if neglected.
The database server can become a single point of failure if your database server is not deployed in a highly available configuration. OpenStack does not require a high-availability installation of your database, and as a result, many installations may skip this step. However, production deployments of OpenStack should take care to ensure that their database can survive the failure of a single database server.
For installations that use the MySQL database engine, there are several options that can be used to cluster your installation. One popular method is to leverage Galera Cluster (http://galeracluster.com/). Galera Cluster for MySQL leverages synchronous replication and provides a multi-master cluster, which offers high availability for your OpenStack databases.
Installations that use the Postgres database engine have several options: high availability, load balancing, and replication. Options include block device replication with DRBD, log shipping, master-standby replication based on triggers, statement based replication, and asynchronous multi-master replication. For details, refer to the Postgres high-availability guide (http://www.postgresql.org/docs/current/static/high-availability.html).
Database performance is one of those metrics that can degrade over time. For those administrators who do not pay attention, small problems in this area can eventually become large problems. A wise administrator will regularly monitor the performance of their database constantly to be on a lookout for slow queries, high-database loads, and other indications of trouble.
There are several options to monitor your MySQL server, some of which are commercial and many that are open source. Administrators should evaluate the options available and select a solution that fits their current set of tools and operating environment. There are several performance metrics you will want to monitor. Some of them are discussed in the following sections.
The MySQL SHOW STATUS
statement can be executed from the mysql
Command Prompt. The output of this statement is server status information with over 300 variables that are reported. To narrow down this information, you can leverage a LIKE
clause on the variable_name
command to display the sections you are interested in. Here is an abbreviated list of the output returned by SHOW STATUS
:
mysql> SHOW STATUS; +------------------------------------------+-------------+ | Variable_name | Value | +------------------------------------------+-------------+ | Aborted_clients | 29 | | Aborted_connects | 27 | | Binlog_cache_disk_use | 0 | | Binlog_cache_use | 0 | | Binlog_stmt_cache_disk_use | 0 | | Binlog_stmt_cache_use | 0 | | Bytes_received | 614 | | Bytes_sent | 33178 |
Mytop is a command-line utility inspired by the Linux top
command. Mytop retrieves data from the MySql SHOW PROCESSLIST
and SHOW STATUS
commands. Data from these commands is refreshed, processed, and displayed in the output of the Mytop command. The Mytop output includes a header, which contains summary data, followed by a thread section.
Here is an example of the header output from the Mytop
command:
MySQL on localhost (5.5.46) load 1.01 0.85 0.79 4/538 23573 up 5+02:19:24 [14:35:24] Queries: 3.9M qps: 9 Slow: 0.0 Se/In/Up/De(%): 49/00/08/00 Sorts: 0 qps now: 10 Slow qps: 0.0 Threads: 30 ( 1/ 4) 40/00/12/00 Cache Hits: 822.0 Hits/s: 0.0 Hits now: 0.0 Ratio: 0.0% Ratio now: 0.0% Key Efficiency: 97.3% Bps in/out: 1.7k/ 3.1k Now in/out: 1.0k/ 3.9k
As demonstrated in the preceding output, the header section for the Mytop
command includes the following information:
Select
, Insert
, Update
, and Delete
queriesThe Mytop thread section will list as many threads as can be displayed. The threads are ordered by the Time
column, which displays the threads idle time:
Id User Host/IP DB Time Cmd State Query -- ---- ------- -- ---- --- ----- ---------- 3461 neutron 174.143.201.98 neutron 5680 Sleep 3477 glance 174.143.201.98 glance 1480 Sleep 3491 nova 174.143.201.98 nova 880 Sleep 3512 nova 174.143.201.98 nova 281 Sleep 3487 keystone 174.143.201.98 keystone 280 Sleep 3489 glance 174.143.201.98 glance 280 Sleep 3511 keystone 174.143.201.98 keystone 280 Sleep 3513 neutron 174.143.201.98 neutron 280 Sleep 3505 keystone 174.143.201.98 keystone 279 Sleep 3514 keystone 174.143.201.98 keystone 141 Sleep ...
The Mytop thread section displays the ID of each thread followed by the user and host. Finally, this section will display the database, idle time, and state or command query. Mytop will allow you to keep an eye on the performance of your MySQL database server.
Percona Toolkit is a very useful set of command-line tools that are used to perform MySQL operations and system tasks. The toolkit can be downloaded from Percona at https://www.percona.com/downloads/percona-toolkit/. The output from these tools can be fed into your monitoring system, allowing you to effectively monitor your MySQL installation.
Like MySQL, the Postgres database also has a series of tools, which can be leveraged to monitor database performance. In addition to standard Linux troubleshooting tools, such as top
and ps
, Postgres also offers its own collection of statistics.
The statistics collector in Postgres allows you to collect data related to a server's activity. The statistics collected in this tool is varied, and may be helpful for troubleshooting or system monitoring. In order to leverage the statistics collector, you must turn on the functionality in the postgresql.conf
file. The settings are commented out by default in the RUNTIME STATISTICS
section of the configuration file. Uncomment the lines in the Query/Index Statistics Collector
subsection:
#------------------------------------------------------------------------------ # RUNTIME STATISTICS #------------------------------------------------------------------------------ # - Query/Index Statistics Collector - track_activities = on track_counts = on track_io_timing = off track_functions = none # none, pl, all track_activity_query_size = 1024 # (change requires restart) update_process_title = on stats_temp_directory = 'pg_stat_tmp'
Once the statistics collector is configured, restart the database server or execute a pg_ctl
command reload for the configuration to take effect. Once the collector has been configured, there will be a series of views created that are named with the prefix pg_stat
. These views can be queried for relevant statistics in the Posgres database server.
A diligent operator will be sure to take a backup of the database for each OpenStack project. Since most OpenStack services make heavy use of the database to persist things such as states and metadata, a corruption or loss of data could render your OpenStack cloud unusable. Current database backups can help rescue you from this fate. MySQL users can use the mysqldump
utility to take a back up of all OpenStack databases:
mysqldump --opt --all-databases > all_openstack_dbs.sql
Similarly, Postgres users can take a back up of all OpenStack databases with a command similar to the one shown here:
pg_dumpall > all_openstack_dbs.sql
Your cadence for backups will depend on your environment and tolerance for data corruption or loss. You should store these backups in a safe place and occasionally deploy test restores from the data to ensure that they are working as expected.
Monitoring is often your early warning system that something is going wrong in your cluster. Your monitoring system can also be a rich source of information when it there comes a time to troubleshoot issues with the cluster. There are multiple options available to monitor OpenStack. Many of your current application monitoring platforms will handle OpenStack just as well as any other Linux system. Regardless of the tool you select to do your monitoring, there are several parts of OpenStack you should focus on.
OpenStack is typically deployed on a series of Linux servers. Monitoring the resources on those servers is essential. A set-it-and-forget-it attitude is a recipe for disaster. Things you may want to monitor on your host servers include the following:
OpenStack operators have the option of setting usage quotas for each tenant/project. As an administrator, it is helpful to monitor a project's amount of usage, as it pertains to these quotas. Once users reach a quota, they may not be able to deploy additional resources. Users may misinterpret this as an error in the system and report it to you as such. By keeping an eye on the quotas, you can proactively warn users as they reach their thresholds or you can decide to increase the quotas as appropriate. Some of the services have client commands that can be used to retrieve quota statistics. As an example, take a look at the nova absolute-limits
command here:
nova absolute-limits +--------------------+------+-------+ | Name | Used | Max | +--------------------+------+-------+ | Cores | 1 | 20 | | FloatingIps | 0 | 10 | | ImageMeta | - | 128 | | Instances | 1 | 10 | | Keypairs | - | 100 | | Personality | - | 5 | | Personality Size | - | 10240 | | RAM | 512 | 51200 | | SecurityGroupRules | - | 20 | | SecurityGroups | 1 | 10 | | Server Meta | - | 128 | | ServerGroupMembers | - | 10 | | ServerGroups | 0 | 10 | +--------------------+------+-------+
The absolute-limits
command in Nova is nice because it displays the project's current usage along with the quota maximum, making it easy to note that a project/tenant is close to the limit.