Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Maintenance, Failures, and Debugging

Downtime, whether planned or unscheduled, is a certainty when running a cloud. This chapter aims to provide useful information for dealing proactively, or reactively with these occurrences.

Cloud Controller and Storage Proxy Failures and Maintenance

The cloud controller and storage proxy are very similar to each other when it comes to expected and unexpected downtime. One of each server type typically runs in the cloud, which makes them very noticeable when they are not running.

For the cloud controller, the good news is if your cloud is using the FlatDHCP multi-host HA network mode, existing instances and volumes continue to operate while the cloud controller is offline. However for the storage proxy, no storage traffic is possible until it is back up and running.

Planned Maintenance

One way to plan for cloud controller or storage proxy maintenance is to simply do it off-hours, such as at 1 or 2 A.M.. This strategy impacts fewer users. If your cloud controller or storage proxy is too important to have unavailable at any point in time, you must look into High Availability options.

Rebooting a cloud controller or Storage Proxy

All in all, just issue the “reboot” command. The operating system cleanly shuts services down and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.

After a Cloud Controller or Storage Proxy Reboots

After a cloud controller reboots, ensure that all required services were successfully started:

# ps aux | grep nova-
# grep AMQP /var/log/nova/nova-*.log
# ps aux | grep glance-
# ps aux | grep keystone
# ps aux | grep cinder

Also check that all services are functioning:

# source openrc
# glance index
# nova list
# keystone tenant-list

For the storage proxy, ensure that the Object Storage service has resumed:

# ps aux | grep swift

Also check that it is functioning:

# swift stat

Total Cloud Controller Failure

Unfortunately, this is a rough situation. The cloud controller is a integral part of your cloud. If you have only one controller, many services are missing.

To avoid this situation, create a highly available cloud controller cluster. This is outside the scope of this document, but you can read more in the draft OpenStack High Availability Guide (http://docs.openstack.org/trunk/openstack-ha/content/ch-intro.html).

The next best way is to use a configuration management tool such as Puppet to automatically build a cloud controller. This should not take more than 15 minutes if you have a spare server available. After the controller rebuilds, restore any backups taken (see the Backup and Recovery chapter).

Also, in practice, sometimes the nova-compute services on the compute nodes do not reconnect cleanly to rabbitmq hosted on the controller when it comes back up after a long reboot and a restart on the nova services on the compute nodes is required.

Compute Node Failures and Maintenance

Sometimes a compute node either crashes unexpectedly or requires a reboot for maintenance reasons.

Planned Maintenance

If you need to reboot a compute node due to planned maintenance (such as a software or hardware upgrade), first ensure that all hosted instances have been moved off of the node. If your cloud is utilizing shared storage, use the nova live-migration command. First, get a list of instances that need to be moved:

# nova list --host c01.example.com --all-tenants

Next, migrate them one by one:

# nova live-migration <uuid> c02.example.com

If you are not using shared storage, you can use the --block-migrate option:

# nova live-migration --block-migrate <uuid> c02.example.com

After you have migrated all instances, ensure the nova-compute service has stopped:

# stop nova-compute

If you use a configuration management system, such as Puppet, that ensures the nova-compute service is always running, you can temporarily move the init files:

# mkdir /root/tmp
# mv /etc/init/nova-compute.conf /root/tmp
# mv /etc/init.d/nova-compute /root/tmp

Next, shut your compute node down, perform your maintenance, and turn the node back on. You can re-enable the nova-compute service by undoing the previous commands:

# mv /root/tmp/nova-compute.conf /etc/init
# mv /root/tmp/nova-compute /etc/init.d/

Then start the nova-compute service:

# start nova-compute

You can now optionally migrate the instances back to their original compute node.

After a Compute Node Reboots

When you reboot a compute node, first verify that it booted successfully. This includes ensuring the nova-compute service is running:

# ps aux | grep nova-compute
# status nova-compute

Also ensure that it has successfully connected to the AMQP server:

# grep AMQP /var/log/nova/nova-compute
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672

After the compute node is successfully running, you must deal with the instances that are hosted on that compute node as none of them is running. Depending on your SLA with your users or customers, you might have to start each instance and ensure they start correctly.

Instances

You can create a list of instances that are hosted on the compute node by performing the following command:

# nova list --host c01.example.com --all-tenants

After you have the list, you can use the nova command to start each instance:

# nova reboot <uuid>

Any time an instance shuts down unexpectedly, it might have problems on boot. For example, the instance might require an fsck on the root partition. If this happens, the user can use the Dashboard VNC console to fix this.

If an instance does not boot, meaning virsh list never shows the instance as even attempting to boot, do the following on the compute node:

# tail -f /var/log/nova/nova-compute.log

Try executing the nova reboot command again. You should see an error message about why the instance was not able to boot

In most cases, the error is due to something in libvirt’s XML file (/etc/libvirt/qemu/instance-xxxxxxxx.xml) that no longer exists. You can enforce recreation of the XML file as well as rebooting the instance by running:

# nova reboot --hard <uuid>

Inspecting and Recovering Data from Failed Instances

In some scenarios, instances are running but are inaccessible through SSH and do not respond to any command. VNC console could be displaying a boot failure or kernel panic error messages. This could be an indication of a file system corruption on the VM itself. If you need to recover files or inspect the content of the instance, qemu-nbd can be used to mount the disk.

If you access or view the user’s content and data, get their approval first!

To access the instance’s disk (/var/lib/nova/instances/instance-xxxxxx/disk), the following steps must be followed:

Suspend the instance using the virsh command
Connect the qemu-nbd device to the disk
Mount the qemu-nbd device
Unmount the device after inspecting
Disconnect the qemu-nbd device
Resume the instance

If you do not follow the steps from 4-6, OpenStack Compute cannot manage the instance any longer. It fails to respond to any command issued by OpenStack Compute and it is marked as shutdown.

Once you mount the disk file, you should be able access it and treat it as normal directories with files and a directory structure. However, we do not recommend that you edit or touch any files because this could change the acls and make the instance unbootable if it is not already.

Suspend the instance using the virsh command - taking note of the internal ID.

root@compute-node:~# virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running 
30 instance-0000274a    running
                  
root@compute-node:~# virsh suspend 30
Domain 30 suspended

Connect the qemu-nbd device to the disk

root@compute-node:/var/lib/nova/instances/instance-0000274a# ls -lh
total 33M
-rw-rw---- 1 libvirt-qemu kvm  6.3K Oct 15 11:31 console.log
-rw-r--r-- 1 libvirt-qemu kvm   33M Oct 15 22:06 disk
-rw-r--r-- 1 libvirt-qemu kvm  384K Oct 15 22:06 disk.local
-rw-rw-r-- 1 nova         nova 1.7K Oct 15 11:30 libvirt.xml
root@compute-node:/var/lib/nova/instances/instance-0000274a# qemu-nbd -c /dev/nbd0 `pwd`/disk

Mount the qemu-nbd device.

The qemu-nbd device tries to export the instance disk’s different partitions as separate devices. For example if vda as the disk and vda1 as the root partition, qemu-nbd exports the device as /dev/nbd0 and /dev/nbd0p1 respectively.

#mount the root partition of the device
root@compute-node:/var/lib/nova/instances/instance-0000274a# mount /dev/nbd0p1 /mnt/
# List the directories of mnt, and the vm's folder is display
# You can inspect the folders and access the /var/log/ files

To examine the secondary or ephemeral disk, use an alternate mount point if you want both primary and secondary drives mounted at the same time.

# umount /mnt
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
# mount /dev/nbd1 /mnt/

root@compute-node:/var/lib/nova/instances/instance-0000274a# ls -lh /mnt/
total 76K
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 bin -> usr/bin
dr-xr-xr-x.  4 root root 4.0K Oct 15 01:07 boot
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 dev
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
drwxr-xr-x.  3 root root 4.0K Oct 15 01:07 home
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 lib -> usr/lib
lrwxrwxrwx.  1 root root    9 Oct 15 00:44 lib64 -> usr/lib64
drwx------.  2 root root  16K Oct 15 00:42 lost+found
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 media
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 mnt
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 opt
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 proc
dr-xr-x---.  3 root root 4.0K Oct 15 21:56 root
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
lrwxrwxrwx.  1 root root    8 Oct 15 00:44 sbin -> usr/sbin
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 srv
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 sys
drwxrwxrwt.  9 root root 4.0K Oct 15 16:29 tmp
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var

Once you have completed the inspection, umount the mount point and release the qemu-nbd device

root@compute-node:/var/lib/nova/instances/instance-0000274a# umount /mnt
root@compute-node:/var/lib/nova/instances/instance-0000274a# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected

Resume the instance using virsh

root@compute-node:/var/lib/nova/instances/instance-0000274a# virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running
30 instance-0000274a    paused
                  
root@compute-node:/var/lib/nova/instances/instance-0000274a# virsh resume 30
Domain 30 resumed

Volumes

If the affected instances also had attached volumes, first generate a list of instance and volume UUIDs:

mysql> select nova.instances.uuid as instance_uuid, cinder.volumes.id as volume_uuid, cinder.volumes.status, 
cinder.volumes.attach_status, cinder.volumes.mountpoint, cinder.volumes.display_name from cinder.volumes
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid 
 where nova.instances.host = 'c01.example.com';

You should see a result like the following:


+--------------+------------+-------+--------------+-----------+--------------+
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
+--------------+------------+-------+--------------+-----------+--------------+
|9b969a05      |1f0fbf36    |in-use |attached      |/dev/vdc   | test         |
+--------------+------------+-------+--------------+-----------+--------------+
1 row in set (0.00 sec)

Next, manually detach and reattach the volumes:

# nova volume-detach <instance_uuid> <volume_uuid>
# nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX

Where X is the proper mount point. Make sure that the instance has successfully booted and is at a login screen before doing the above.

Total Compute Node Failure

If a compute node fails and won’t be fixed for a few hours or ever, you can relaunch all instances that are hosted on the failed node if you use shared storage for /var/lib/nova/instances.

To do this, generate a list of instance UUIDs that are hosted on the failed node by running the following query on the nova database:

mysql> select uuid from instances where host = 'c01.example.com' and deleted = 0;

Next, tell Nova that all instances that used to be hosted on c01.example.com are now hosted on c02.example.com:

mysql> update instances set host = 'c02.example.com' where host = 'c01.example.com' and deleted = 0;

After that, use the nova command to reboot all instances that were on c01.example.com while regenerating their XML files at the same time:

# nova reboot --hard <uuid>

Finally, re-attach volumes using the same method described in Volumes.

/var/lib/nova/instances

It’s worth mentioning this directory in the context of failed compute nodes. This directory contains the libvirt KVM file-based disk images for the instances that are hosted on that compute node. If you are not running your cloud in a shared storage environment, this directory is unique across all compute nodes.

/var/lib/nova/instances contains two types of directories.

The first is the _base directory. This contains all of the cached base images from glance for each unique image that has been launched on that compute node. Files ending in _20 (or a different number) are the ephemeral base images.

The other directories are titled instance-xxxxxxxx. These directories correspond to instances running on that compute node. The files inside are related to one of the files in the _base directory. They’re essentially differential-based files containing only the changes made from the original _base directory.

All files and directories in /var/lib/nova/instances are uniquely named. The files in _base are uniquely titled for the glance image that they are based on and the directory names instance-xxxxxxxx are uniquely titled for that particular instance. For example, if you copy all data from /var/lib/nova/instances on one compute node to another, you do not overwrite any files or cause any damage to images that have the same unique name, because they are essentially the same file.

Although this method is not documented or supported, you can use it when your compute node is permanently offline but you have instances locally stored on it.

Storage Node Failures and Maintenance

Due to the Object Storage’s high redundancy, dealing with object storage node issues is a lot easier than dealing with compute node issues.

Rebooting a Storage Node

If a storage node requires a reboot, simply reboot it. Requests for data hosted on that node are redirected to other copies while the server is rebooting.

Shutting Down a Storage Node

If you need to shut down a storage node for an extended period of time (1+ days), consider removing the node from the storage ring. For example:

# swift-ring-builder account.builder remove <ip address of storage node>
# swift-ring-builder container.builder remove <ip address of storage node>
# swift-ring-builder object.builder remove <ip address of storage node>
# swift-ring-builder account.builder rebalance
# swift-ring-builder container.builder rebalance
# swift-ring-builder object.builder rebalance

Next, redistribute the ring files to the other nodes:

# for i in s01.example.com s02.example.com s03.example.com
> do
> scp *.ring.gz $i:/etc/swift
> done

These actions effectively take the storage node out of the storage cluster.

When the node is able to rejoin the cluster, just add it back to the ring. The exact syntax to add a node to your Swift cluster using swift-ring-builder heavily depends on the original options used when you originally created your cluster. Please refer back to those commands.

Replacing a Swift Disk

If a hard drive fails in a Object Storage node, replacing it is relatively easy. This assumes that your Object Storage environment is configured correctly where the data that is stored on the failed drive is also replicated to other drives in the Object Storage environment.

This example assumes that /dev/sdb has failed.

First, unmount the disk:

# umount /dev/sdb

Next, physically remove the disk from the server and replace it with a working disk.

Ensure that the operating system has recognized the new disk:

# dmesg | tail

You should see a message about /dev/sdb.

Because it is recommended to not use partitions on a swift disk, simply format the disk as a whole:

# mkfs.xfs /dev/sdb

Finally, mount the disk:

# mount -a

Swift should notice the new disk and that no data exists. It then begins replicating the data to the disk from the other existing replicas.

Handling a Complete Failure

A common way of dealing with the recovery from a full system failure, such as a power outage of a data center is to assign each service a priority, and restore in order.

1	Internal network connectivity
2	Backing storage services
3	Public network connectivity for user Virtual Machines
4	Nova-compute, nova-network, cinder hosts
5	User virtual machines
10	Message Queue and Database services
15	Keystone services
20	cinder-scheduler
21	Image Catalogue and Delivery services
22	nova-scheduler services
98	Cinder-api
99	Nova-api services
100	Dashboard node

Use this example priority list to ensure that user affected services are restored as soon as possible, but not before a stable environment is in place. Of course, despite being listed as a single line item, each step requires significant work. For example, just after starting the database, you should check its integrity or, after starting the Nova services, you should verify that the hypervisor matches the database and fix any mismatches.

Configuration Management

Maintaining an OpenStack cloud requires that you manage multiple physical servers, and this number might grow over time. Because managing nodes manually is error-prone, we strongly recommend that you use a configuration management tool. These tools automate the process of ensuring that all of your nodes are configured properly and encourage you to maintain your configuration information (such as packages and configuration options) in a version controlled repository.

Several configuration management tools are available, and this guide does not recommend a specific one. The two most popular ones in the OpenStack community are Puppet (https://puppetlabs.com/) with available OpenStack Puppet modules (http://github.com/puppetlabs/puppetlabs-openstack) and Chef (http://opscode.com/chef) with available OpenStack Chef recipes (https://github.com/opscode/openstack-chef-repo). Other newer configuration tools include Juju (https://juju.ubuntu.com/) Ansible (http://ansible.cc) and Salt (http://saltstack.com), and more mature configuration management tools include CFEngine (http://cfengine.com) and Bcfg2 (http://bcfg2.org).

Working with Hardware

Similar to your initial deployment, you should ensure all hardware is appropriately burned in before adding it to production. Run software that uses the hardware to its limits - maxing out RAM, CPU, disk and network. Many options are available, and normally double as benchmark software so you also get a good idea of the performance of your system.

Adding a Compute Node

If you find that you have reached or are reaching the capacity limit of your computing resources, you should plan to add additional compute nodes. Adding more nodes is quite easy. The process for adding nodes is the same as when the initial compute nodes were deployed to your cloud: use an automated deployment system to bootstrap the bare-metal server with the operating system and then have a configuration management system install and configure the OpenStack Compute service. Once the Compute service has been installed and configured in the same way as the other compute nodes, it automatically attaches itself to the cloud. The cloud controller notices the new node(s) and begin scheduling instances to launch there.

If your OpenStack Block Storage nodes are separate from your compute nodes, the same procedure still applies as the same queuing and polling system is used in both services.

We recommend that you use the same hardware for new compute and block storage nodes. At the very least, ensure that the CPUs are similar in the compute nodes to not break live migration.

Adding an Object Storage Node

Adding a new object storage node is different than adding compute or block storage nodes. You still want to initially configure the server by using your automated deployment and configuration management systems. After that is done, you need to add the local disks of the object storage node into the object storage ring. The exact command to do this is the same command that was used to add the initial disks to the ring. Simply re-run this command on the object storage proxy server for all disks on the new object storage node. Once this has been done, rebalance the ring and copy the resulting ring files to the other storage nodes.

If your new object storage node has a different number of disks than the original nodes have, the command to add the new node is different than the original commands. These parameters vary from environment to environment.

Replacing Components

Failures of hardware are common in large scale deployments such as an infrastructure cloud. Consider your processes and balance time saving against availability. For example, an Object Storage cluster can easily live with dead disks in it for some period of time if it has sufficient capacity. Or, if your compute installation is not full you could consider live migrating instances off a host with a RAM failure until you have time to deal with the problem.

Databases

Almost all OpenStack components have an underlying database to store persistent information. Usually this database is MySQL. Normal MySQL administration is applicable to these databases. OpenStack does not configure the databases out of the ordinary. Basic administration includes performance tweaking, high availability, backup, recovery, and repairing. For more information, see a standard MySQL administration guide.

You can perform a couple tricks with the database to either more quickly retrieve information or fix a data inconsistency error. For example, an instance was terminated but the status was not updated in the database. These tricks are discussed throughout this book.

Database Connectivity

Review the components configuration file to see how each OpenStack component accesses its corresponding database. Look for either sql_connection or simply connection:

# grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf 
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
	sql_connection = mysql://nova:[email protected]/nova
	sql_connection = mysql://glance:[email protected]/glance 
	sql_connection = mysql://glance:[email protected]/glance 
    sql_connection=mysql://cinder:[email protected]/cinder 
	connection = mysql://keystone_admin:[email protected]/keystone

The connection strings take this format:

mysql:// <username> : <password> @ <hostname> / <database name>

Performance and Optimizing

As your cloud grows, MySQL is utilized more and more. If you suspect that MySQL might be becoming a bottleneck, you should start researching MySQL optimization. The MySQL manual has an entire section dedicated to this topic Optimization Overview (http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html).

HDWMY

Here’s a quick list of various to-do items each hour, day, week, month, and year. Please note these tasks are neither required nor definitive, but helpful ideas:

Hourly

Check your monitoring system for alerts and act on them.
Check your ticket queue for new tickets.

Daily

Check for instances in a failed or weird state and investigate why.
Check for security patches and apply them as needed.

Weekly

Check cloud usage:
- User quotas
- Disk space
- Image usage
- Large instances
- Network usage (bandwidth and IP usage)
Verify your alert mechanisms are still working.

Monthly

Check usage and trends over the past month.
Check for user accounts that should be removed.
Check for operator accounts that should be removed.

Quarterly

Review usage and trends over the past quarter.
Prepare any quarterly reports on usage and statistics.
Review and plan any necessary cloud additions.
Review and plan any major OpenStack upgrades.

Semi-Annually

Upgrade OpenStack.
Clean up after OpenStack upgrade (any unused or new services to be aware of?)

Determining which Component Is Broken

OpenStack’s collection of different components interact with each other strongly. For example, uploading an image requires interaction from nova-api, glance-api, glance-registry, Keystone, and potentially swift-proxy. As a result, it is sometimes difficult to determine exactly where problems lie. Assisting in this is the purpose of this section.

Tailing Logs

The first place to look is the log file related to the command you are trying to run. For example, if nova list is failing, try tailing a Nova log file and running the command again:

Terminal 1:

# tail -f /var/log/nova/nova-api.log

Terminal 2:

# nova list

Look for any errors or traces in the log file. For more information, see the chapter on Logging and Monitoring.

If the error indicates that the problem is with another component, switch to tailing that component’s log file. For example, if nova cannot access glance, look at the glance-api log:

Terminal 1:

# tail -f /var/log/glance/api.log

Terminal 2:

# nova list

Wash, rinse, repeat until you find the core cause of the problem.

Running Daemons on the CLI

Unfortunately, sometimes the error is not apparent from the log files. In this case, switch tactics and use a different command, maybe run the service directly on the command line. For example, if the glance-api service refuses to start and stay running, try launching the daemon from the command line:

# sudo -u glance -H glance-api

This might print the error and cause of the problem.

The -H flag is required when running the daemons with sudo because some daemons will write files relative to the user’s home directory, and this write may fail if -H is left off.

Example of Complexity

One morning, a compute node failed to run any instances. The log files were a bit vague, claiming that a certain instance was unable to be started. This ended up being a red herring because the instance was simply the first instance in alphabetical order, so it was the first instance that nova-compute would touch.

Further troubleshooting showed that libvirt was not running at all. This made more sense. If libvirt wasn’t running, then no instance could be virtualized through KVM. Upon trying to start libvirt, it would silently die immediately. The libvirt logs did not explain why.

Next, the libvirtd daemon was run on the command line. Finally a helpful error message: it could not connect to d-bus. As ridiculous as it sounds, libvirt, and thus nova-compute, relies on d-bus and somehow d-bus crashed. Simply starting d-bus set the entire chain back on track and soon everything was back up and running.

Upgrades

With the exception of Object Storage, an upgrade from one version of OpenStack to another is a great deal of work.

The upgrade process generally follows these steps:

Read the release notes and documentation.
Find incompatibilities between different versions.
Plan an upgrade schedule and complete it in order on a test cluster.
Run the upgrade.

You can perform an upgrade while user instances run. However, this strategy can be dangerous. Don’t forget appropriate notice to your users, and backups.

The general order that seems to be most successful is:

Upgrade the OpenStack Identity service (keystone).
Upgrade the OpenStack Image service (glance).
Upgrade all OpenStack Compute (nova) services.
Upgrade all OpenStack Block Storage (cinder) services.

For each of these steps, complete the following sub-steps:

Stop services.
Create a backup of configuration files and databases.
Upgrade the packages using your distribution’s package manager.
Update the configuration files according to the release notes.
Apply the database upgrades.
Restart the services.
Verify that everything is running.

Probably the most important step of all is the pre-upgrade testing. Especially if you are upgrading immediately after release of a new version, undiscovered bugs might hinder your progress. Some deployers prefer to wait until the first point release is announced. However, if you have a significant deployment, you might follow the development and testing of the release, thereby ensuring that bugs for your use cases are fixed.

To complete an upgrade of OpenStack Compute while keeping instances running, you should be able to use live migration to move machines around while performing updates, and then move them back afterward as this is a property of the hypervisor. However, it is critical to ensure that database changes are successful otherwise an inconsistent cluster state could arise.

Performing some ‘cleaning’ of the cluster prior to starting the upgrade is also a good idea, to ensure the state is consistent. For example some have reported issues with instances that were not fully removed from the system after their deletion. Running a command equivalent to:

$ virsh list --all

to find deleted instances that are still registered in the hypervisor and removing them prior to running the upgrade can avoid issues.

Uninstalling

While we’d always recommend using your automated deployment system to re-install systems from scratch, sometimes you do need to remove OpenStack from a system the hard way. Here’s how:

Remove all packages
Remove remaining files
Remove databases

These steps depend on your underlying distribution, but in general you should be looking for ‘purge’ commands in your package manager, like aptitude purge ~c $package. Following this, you can look for orphaned files in the directories referenced throughout this guide. For uninstalling the database properly, refer to the manual appropriate for the product in use.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Maintenance, Failures, and Debugging

Create new playlist

Sign In

Sign Up

Chapter 11. Maintenance, Failures, and Debugging

Cloud Controller and Storage Proxy Failures and Maintenance

Planned Maintenance

Rebooting a cloud controller or Storage Proxy

After a Cloud Controller or Storage Proxy Reboots

Total Cloud Controller Failure

Compute Node Failures and Maintenance

Planned Maintenance

After a Compute Node Reboots

Instances

Inspecting and Recovering Data from Failed Instances

Volumes

Total Compute Node Failure

/var/lib/nova/instances

Storage Node Failures and Maintenance

Rebooting a Storage Node

Shutting Down a Storage Node

Replacing a Swift Disk

Handling a Complete Failure

Configuration Management

Working with Hardware

Adding a Compute Node

Adding an Object Storage Node

Replacing Components

Databases

Database Connectivity

Performance and Optimizing

HDWMY

Hourly

Daily

Weekly

Monthly

Quarterly

Semi-Annually

Determining which Component Is Broken

Tailing Logs

Running Daemons on the CLI

Example of Complexity

Upgrades

Uninstalling

Table of Contents for
Maintenance, Failures, and Debugging