Chapter 13. Logging and Monitoring

As an OpenStack cloud is composed of so many different services, there are a large number of log files. This section aims to assist you in locating and working with them, and other ways to track the status of your deployment.

Where Are the Logs?

On Ubuntu, most services use the convention of writing their log files to subdirectories of the /var/log directory.

Cloud Controller

Service Log Location

nova-*

/var/log/nova

glance-*

/var/log/glance

cinder-*

/var/log/cinder

keystone

/var/log/keystone

horizon

/var/log/apache2/

misc (Swift, dnsmasq)

/var/log/syslog

Compute Nodes

libvirt: /var/log/libvirt/libvirtd.log

Console (boot up messages) for VM instances: /var/lib/nova/instances/instance-<instance id>/console.log

Block Storage Nodes

cinder: /var/log/cinder/cinder-volume.log

How to Read the Logs

OpenStack services use the standard logging levels, at increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR, CRITICAL, and TRACE. That is, messages only appear in the logs if they are more “severe” than the particular log level with DEBUG allowing all log statements through. For example, TRACE is logged only if the software has a stack trace, while INFO is logged for every message including those that are only for information.

To disable DEBUG-level logging, edit /etc/nova/nova.conf:

debug=false

Keystone is handled a little differently. To modify the logging level, edit the /etc/keystone/logging.conf file and look at the logger_root and handler_file sections.

Logging for Horizon is configured in /etc/openstack_dashboard/local_settings.py. As Horizon is a Django web application, it follows the Django Logging (https://docs.djangoproject.com/en/dev/topics/logging/) framework conventions.

The first step in finding the source of an error is typically to search for a CRITICAL, TRACE, or ERROR message in the log starting at the bottom of the log file.

An example of a CRITICAL log message, with the corresponding TRACE (Python traceback) immediately following:

2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group 
 cinder-volumes doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder Traceback (most recent call last):
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/bin/cinder-volume", line 48, in <module>
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 422, in wait
2013-02-25 21:05:51 17409 TRACE cinder _launcher.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 127, in wait
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
2013-02-25 21:05:51 17409 TRACE cinder return self._exit_event.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-02-25 21:05:51 17409 TRACE cinder return hubs.get_hub().switch()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
2013-02-25 21:05:51 17409 TRACE cinder return self.greenlet.switch()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
2013-02-25 21:05:51 17409 TRACE cinder result = function(*args, **kwargs)
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 88, in run_server
2013-02-25 21:05:51 17409 TRACE cinder server.start()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 159, in start
2013-02-25 21:05:51 17409 TRACE cinder self.manager.init_host()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 95, 
 in init_host
2013-02-25 21:05:51 17409 TRACE cinder self.driver.check_for_setup_error()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 116, 
 in check_for_setup_error
2013-02-25 21:05:51 17409 TRACE cinder raise exception.VolumeBackendAPIException(data=exception_message)
2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume 
 backend API: volume group cinder-volumes doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder

In this example, cinder-volumes failed to start and has provided a stack trace, since its volume back-end has been unable to setup the storage volume - probably because the LVM volume that is expected from the configuration does not exist.

An example error log:

2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
 [Errno 111] ECONNREFUSED. Trying again in 23 seconds.

In this error, a nova service has failed to connect to the RabbitMQ server, because it got a connection refused error.

Tracing Instance Requests

When an instance fails to behave properly, you will often have to trace activity associated with that instance across the log files of various nova-* services, and across both the cloud controller and compute nodes.

The typical way is to trace the UUID associated with an instance across the service logs.

Consider the following example:

ubuntu@initial:~$ nova list
+--------------------------------+--------+--------+--------------------------+
| ID                             | Name   | Status | Networks                 |
+--------------------------------+--------+--------+--------------------------+
| fafed8-4a46-413b-b113-f1959ffe | cirros | ACTIVE | novanetwork=192.168.100.3|
+--------------------------------------+--------+--------+--------------------+

Here the ID associated with the instance is faf7ded8-4a46-413b-b113-f19590746ffe. If you search for this string on the cloud controller in the /var/log/nova-*.log files, it appears in nova-api.log, and nova-scheduler.log. If you search for this on the compute nodes in /var/log/nova-*.log, it appears nova-network.log and nova-compute.log. If no ERROR or CRITICAL messages appear, the most recent log entry that reports this may provide a hint about what has gone wrong.

Adding Custom Logging Statements

If there is not enough information in the existing logs, you may need to add your own custom logging statements to the nova-* services.

The source files are located in /usr/lib/python2.7/dist-packages/nova

To add logging statements, the following line should be near the top of the file. For most files, these should already be there:

from nova.openstack.common import log as logging
LOG = logging.getLogger(__name__)

To add a DEBUG logging statement, you would do:

LOG.debug("This is a custom debugging statement")

You may notice that all of the existing logging messages are preceded by an underscore and surrounded by parentheses, for example:

LOG.debug(_("Logging statement appears here"))

This is used to support translation of logging messages into different languages using the gettext (http://docs.python.org/2/library/gettext.html) internationalization library. You don’t need to do this for your own custom log messages. However, if you want to contribute the code back to the OpenStack project that includes logging statements, you must surround your log messages with underscore and parentheses.

RabbitMQ Web Management Interface or rabbitmqctl

Aside from connection failures, RabbitMQ log files are generally not useful for debugging OpenStack related issues. Instead, we recommend you use the RabbitMQ web management interface. Enable it on your cloud controller:

# /usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management
# service rabbitmq-server restart

The RabbitMQ web management interface is accessible on your cloud controller at http://localhost:55672.

Ubuntu 12.04 installs RabbitMQ version 2.7.1, which uses port 55672. RabbitMQ versions 3.0 and above use port 15672 instead. You can check which version of RabbitMQ you have running on your local Ubuntu machine by doing:

$ dpkg -s rabbitmq-server | grep "Version:"
Version: 2.7.1-0ubuntu4

An alternative to enabling the RabbitMQ Web Management Interface is to use the rabbitmqctl commands. For example, rabbitmqctl list_queues| grep cinder displays any messages left in the queue. If there are, it’s a possible sign that cinder services didn’t connect properly to rabbitmq and might have to be restarted.

Items to monitor for RabbitMQ include the number of items in each of the queues and the processing time statistics for the server.

Centrally Managing Logs

Because your cloud is most likely composed of many servers, you must check logs on each of those servers to properly piece an event together. A better solution is to send the logs of all servers to a central location so they can all be accessed from the same area.

Ubuntu uses rsyslog as the default logging service. Since it is natively able to send logs to a remote location, you don’t have to install anything extra to enable this feature, just modify the configuration file. In doing this, consider running your logging over a management network, or using an encrypted VPN to avoid interception.

rsyslog Client Configuration

To begin, configure all OpenStack components to log to syslog in addition to their standard log file location. Also configure each component to log to a different syslog facility. This makes it easier to split the logs into individual components on the central server.

nova.conf:

use_syslog=True
syslog_log_facility=LOG_LOCAL0

glance-api.conf and glance-registry.conf:

use_syslog=True
syslog_log_facility=LOG_LOCAL1

cinder.conf:

use_syslog=True
syslog_log_facility=LOG_LOCAL2

keystone.conf:

use_syslog=True
syslog_log_facility=LOG_LOCAL3

Swift

By default, Swift logs to syslog.

Next, create /etc/rsyslog.d/client.conf with the following line:

*.* @192.168.1.10

This instructs rsyslog to send all logs to the IP listed. In this example, the IP points to the Cloud Controller.

rsyslog Server Configuration

Designate a server as the central logging server. The best practice is to choose a server that is solely dedicated to this purpose. Create a file called /etc/rsyslog.d/server.conf with the following contents:

# Enable UDP 
$ModLoad imudp 
# Listen on 192.168.1.10 only 
$UDPServerAddress 192.168.1.10
# Port 514 
$UDPServerRun 514  
      
# Create logging templates for nova
$template NovaFile,"/var/log/rsyslog/%HOSTNAME%/nova.log" 
$template NovaAll,"/var/log/rsyslog/nova.log"
      

      
# Log everything else to syslog.log 
$template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
*.* ?DynFile
      

      
# Log various openstack components to their own individual file
local0.* ?NovaFile 
local0.* ?NovaAll 
& ~

The above example configuration handles the nova service only. It first configures rsyslog to act as a server that runs on port 514. Next, it creates a series of logging templates. Logging templates control where received logs are stored. Using the example above, a nova log from c01.example.com goes to the following locations:

  • /var/log/rsyslog/c01.example.com/nova.log

  • /var/log/rsyslog/nova.log

This is useful as logs from c02.example.com go to:

  • /var/log/rsyslog/c02.example.com/nova.log

  • /var/log/rsyslog/nova.log

So you have an individual log file for each compute node as well as an aggregated log that contains nova logs from all nodes.

StackTach

StackTach is a tool created by Rackspace to collect and report the notifications sent by nova. Notifications are essentially the same as logs, but can be much more detailed. A good overview of notifications can be found at System Usage Data (https://wiki.openstack.org/wiki/SystemUsageData).

To enable nova to send notifications, add the following to nova.conf:

notification_topics=monitor 
notification_driver=nova.openstack.common.notifier.rabbit_notifier

Once nova is sending notifications, install and configure StackTach. Since StackTach is relatively new and constantly changing, installation instructions would quickly become outdated. Please refer to the StackTach GitHub repo (https://github.com/rackerlabs/stacktach) for instructions as well as a demo video.

Monitoring

There are two types of monitoring: watching for problems and watching usage trends. The former ensures that all services are up and running, creating a functional cloud. The latter involves monitoring resource usage over time in order to make informed decisions about potential bottlenecks and upgrades.

Process Monitoring

A basic type of alert monitoring is to simply check and see if a required process is running. For example, ensure that the nova-api service is running on the Cloud Controller:

[ root@cloud ~ ] # ps aux | grep nova-api
nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12792 0.0 0.0 96052 22856 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12793 0.0 0.3 290688 115516 ? S Feb11 1:23 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12794 0.0 0.2 248636 77068 ? S Feb11 0:04 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api

You can create automated alerts for critical processes by using Nagios and NRPE. For example, to ensure that the nova-compute process is running on compute nodes, create an alert on your Nagios server that looks like this:

define service { 
    host_name c01.example.com 
    check_command check_nrpe_1arg!check_nova-compute 
    use generic-service 
    notification_period 24x7 
    contact_groups sysadmins 
    service_description nova-compute 
}

Then on the actual compute node, create the following NRPE configuration:

command[check_nova-compute]=/usr/lib/nagios/plugins/check_procs -c 1: -a nova-compute

Nagios checks that at least one nova-compute service is running at all times.

Resource Alerting

Resource alerting provides notifications when one or more resources are critically low. While the monitoring thresholds should be tuned to your specific OpenStack environment, monitoring resource usage is not specific to OpenStack at all – any generic type of alert will work fine.

Some of the resources that you want to monitor include:

  • Disk Usage

  • Server Load

  • Memory Usage

  • Network IO

  • Available vCPUs

For example, to monitor disk capacity on a compute node with Nagios, add the following to your Nagios configuration:

define service { 
    host_name c01.example.com 
    check_command check_nrpe!check_all_disks!20% 10% 
    use generic-service 
    contact_groups sysadmins 
    service_description Disk 
}

On the compute node, add the following to your NRPE configuration:

command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e

Nagios alerts you with a WARNING when any disk on the compute node is 80% full and CRITICAL when 90% is full.

OpenStack-specific Resources

Resources such as memory, disk, and CPU are generic resources that all servers (even non-OpenStack servers) have and are important to the overall health of the server. When dealing with OpenStack specifically, these resources are important for a second reason: ensuring enough are available in order to launch instances. There are a few ways you can see OpenStack resource usage.

The first is through the nova command:

# nova usage-list

This command displays a list of how many instances a tenant has running and some light usage statistics about the combined instances. This command is useful for a quick overview of your cloud, but doesn’t really get into a lot of details.

Next, the nova database contains three tables that store usage information.

The nova.quotas and nova.quota_usages tables store quota information. If a tenant’s quota is different than the default quota settings, their quota is stored in nova.quotas table. For example:

mysql> select project_id, resource, hard_limit from quotas; 
+----------------------------------+-----------------------------+------------+
| project_id                       | resource                    | hard_limit |
+----------------------------------+-----------------------------+------------+
| 628df59f091142399e0689a2696f5baa | metadata_items              | 128        |
| 628df59f091142399e0689a2696f5baa | injected_file_content_bytes | 10240      |
| 628df59f091142399e0689a2696f5baa | injected_files              | 5          |
| 628df59f091142399e0689a2696f5baa | gigabytes                   | 1000       |
| 628df59f091142399e0689a2696f5baa | ram                         | 51200      |
| 628df59f091142399e0689a2696f5baa | floating_ips                | 10         |
| 628df59f091142399e0689a2696f5baa | instances                   | 10         |
| 628df59f091142399e0689a2696f5baa | volumes                     | 10         |
| 628df59f091142399e0689a2696f5baa | cores                       | 20         |
+----------------------------------+-----------------------------+------------+ 

The nova.quota_usages table keeps track of how many resources the tenant currently has in use:

mysql> select project_id, resource, in_use from quota_usages where project_id like '628%';
+----------------------------------+--------------+--------+ 
| project_id                       | resource     | in_use | 
+----------------------------------+--------------+--------+ 
| 628df59f091142399e0689a2696f5baa | instances    | 1      |
| 628df59f091142399e0689a2696f5baa | ram          | 512    | 
| 628df59f091142399e0689a2696f5baa | cores        | 1      | 
| 628df59f091142399e0689a2696f5baa | floating_ips | 1      | 
| 628df59f091142399e0689a2696f5baa | volumes      | 2      | 
| 628df59f091142399e0689a2696f5baa | gigabytes    | 12     | 
| 628df59f091142399e0689a2696f5baa | images       | 1      | 
+----------------------------------+--------------+--------+

By combining the resources used with the tenant’s quota, you can figure out a usage percentage. For example, if this tenant is using 1 Floating IP out of 10, then they are using 10% of their Floating IP quota. You can take this procedure and turn it into a formatted report:


+----------------------------------+------------+-------------+---------------+
| some_tenant                                                                 |
+-----------------------------------+------------+------------+---------------+
| Resource                          | Used       | Limit      |               |
+-----------------------------------+------------+------------+---------------+
| cores                             | 1          | 20         |           5 % |
| floating_ips                      | 1          | 10         |          10 % |
| gigabytes                         | 12         | 1000       |           1 % |
| images                            | 1          | 4          |          25 % |
| injected_file_content_bytes       | 0          | 10240      |           0 % |
| injected_file_path_bytes          | 0          | 255        |           0 % |
| injected_files                    | 0          | 5          |           0 % |
| instances                         | 1          | 10         |          10 % |
| key_pairs                         | 0          | 100        |           0 % |
| metadata_items                    | 0          | 128        |           0 % |
| ram                               | 512        | 51200      |           1 % |
| reservation_expire                | 0          | 86400      |           0 % |
| security_group_rules              | 0          | 20         |           0 % |
| security_groups                   | 0          | 10         |           0 % |
| volumes                           | 2          | 10         |          20 % |
+-----------------------------------+------------+------------+---------------+

The above was generated using a custom script which can be found on GitHub (https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).

This script is specific to a certain OpenStack installation and must be modified to fit your environment. However, the logic should easily be transferable.

Intelligent Alerting

Intelligent alerting can be thought of as a form of continuous integration for operations. For example, you can easily check to see if Glance is up and running by ensuring that the glance-api and glance-registry processes are running or by seeing if glace-api is responding on port 9292.

But how can you tell if images are being successfully uploaded to the Image Service? Maybe the disk that Image Service is storing the images on is full or the S3 back-end is down. You could naturally check this by doing a quick image upload:

#!/bin/bash 
# 
# assumes that reasonable credentials have been stored at
# /root/auth 
 
      
. /root/openrc 
wget https://launchpad.net/cirros/trunk/0.3.0/+download/cirros-0.3.0-x86_64-disk.img  
glance image-create --name='cirros image' --is-public=true --container-format=bare --disk-format=qcow2 < cirros-0.3.0-x8
6_64-disk.img

By taking this script and rolling it into an alert for your monitoring system (such as Nagios), you now have an automated way of ensuring image uploads to the Image Catalog are working.

You must remove the image after each test. Even better, test whether you can successfully delete an image from the Image Service.

Intelligent alerting takes a considerable more amount of time to plan and implement than the other alerts described in this chapter. A good outline to implement intelligent alerting is:

  • Review common actions in your cloud

  • Create ways to automatically test these actions

  • Roll these tests into an alerting system

Some other examples for Intelligent Alerting include:

  • Can instances launch and destroyed?

  • Can users be created?

  • Can objects be stored and deleted?

  • Can volumes be created and destroyed?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset