Chapter 21
Troubleshooting Linux
In any complex operating system, there are a lot of things that can go wrong. You can fail to save a file because you are out of disk space. An application can crash because the system is out of memory. The system can fail to boot up properly for, well, a lot of different reasons.
In Linux, the dedication to openness and the focus on making the software run with maximum efficiency has led to an amazing number of tools you can use to troubleshoot every imaginable problem. In fact, if software isn't working as you would like, you even have the ultimate opportunity to rewrite the code yourself (although we don't cover how to do that here).
This chapter takes on some of the most common problems you can run into on a Linux system and describes the tools and procedures you can use to overcome those problems. Topics are broken down by areas of troubleshooting, such as the boot process, software packages, networking, memory issues, and rescue mode.
Before you can begin troubleshooting a running Linux system itself, that system needs to boot up. For a Linux system to boot up, a series of things has to happen. A Linux system installed directly on a PC architecture computer goes through the following steps to boot up:
The exact activities that occur at each of these points are undergoing a transformation. Boot loaders are changing to accommodate new kinds of hardware. The initialization process is changing so systems can more finely tune the order in which services start and stop.
To help understand the basic steps that occur in the boot process, the next sections follow the boot process for a Red Hat Enterprise Linux 6 system. Although the details of this process are different for the latest Fedora and Ubuntu systems, you will follow the same major steps for troubleshooting the boot process.
The Red Hat Enterprise Linux 6 operating system uses many components to boot the system, all of which have been around for a while. The GRUB boot loader for RHEL 6 has not yet moved to the newer GRUB 2 interface. While you can expect a newer version of the init daemon, which directs the starting and stopping of services, to move to the newer systemd style of startup that is now in Fedora, the RHEL 6 init process (which uses the new upstart init process) still supports the older style System V initialization scripts (used to start system services).
Troubleshooting the RHEL 6 boot process begins when you turn on your computer and ends when all the services are up and running. At that point, there is typically a graphical or text-based login prompt available from the console, ready for you to log in. Go through the following ordered sections to understand what happens at each stage of the boot process and where you might need to troubleshoot.
BIOS, which stands for Basic Input Output System, is the first code to run when you turn on your PC. Its main job is to initialize the hardware and then hand off control of the boot process to a boot loader. Once an operating system is installed, typically you should just let the BIOS do its work and not interrupt it.
There are, however, occasions when you want to interrupt the BIOS. Right after you turn on the power, you should see a BIOS screen that usually includes a few words noting how to go into Setup mode and change the boot order. If you press the function key noted (often F1, F2, or F12) to choose one of those two items, here's what you can do:
For my Dell workstation, once I see the BIOS screen, I immediately press the F2 function key to go to into Setup or F12 to temporarily change the boot order. The next sections explore what you can troubleshoot from the Setup and Boot Order screens.
As I already noted, you can usually let the BIOS start without interruption and have the system boot up to the default boot device (probably the hard drive). However, here are some instances when you may want to go into Setup mode and change something in the BIOS.
Depending on the hardware attached to your computer, a typical boot order might boot a CD/DVD drive first, then a floppy drive, then the hard drive, then a USB device, and finally the network interface card. The BIOS would go to each device, looking for a boot loader. If the BIOS finds a boot loader, it starts it. If no boot loader is located, the BIOS moves on to the next device, until all are tried. If no boot loader is found, the computer fails to boot.
One problem that could occur with the boot order is that the device you want to boot may not appear in the boot order at all. In that case, going to the Setup screen, as described in the previous section, to either enable the device or change a setting to make it bootable, may be the thing to do.
If the device you want to boot from does appear in the boot order, typically you just have to move the arrow key to highlight the device you want and press Enter. The following are reasons for selecting your own device to boot:
Assuming you get past any problems you have with the BIOS, the next step is for the BIOS to start the boot loader.
Typically, the BIOS finds the master boot record on the first hard disk and begins loading that boot loader in stages. Chapter 9, “Installing Linux,” describes the GRUB boot loader that is used with most modern Linux systems, including RHEL, Fedora, and Ubuntu. The GRUB boot loader in RHEL, described here, is an earlier version than the GRUB 2 boot loader included with Fedora and Ubuntu.
In this discussion, I am interested in the boot loader from the perspective of what to do if the boot loader fails or what ways you might want to interrupt the boot loader to change the behavior of the boot process.
Here are a few ways in which the boot loader might fail in RHEL 6 and some ways you can overcome those failures:
If the BIOS finds the boot loader in the master boot record of the disk and that boot loader finds the GRUB configuration files on the disk, the boot loader starts a countdown of about three to five seconds. During that countdown, you can interrupt the boot loader (before it boots the default operating system) by pressing any key.
When you interrupt the boot loader, you should see a menu of available entries to boot. Those entries can represent different available kernels to boot. But they may also represent totally different operating systems (such as Windows, BSD, or Ubuntu).
Here are some reasons to interrupt the boot process from the boot menu to troubleshoot Linux:
Assuming you have selected the kernel you want, the boot loader tries to run the kernel, including the content of the initial RAM disk (which contains drivers and other software needed to boot your particular hardware).
Once the kernel starts, there isn't much to do except watch for potential problems. For RHEL, you will see a Red Hat Enterprise Linux screen with a slow-spinning icon. If you want to watch messages detailing the boot process scroll by, press the Esc key.
At this point, the kernel tries to load the drivers and modules needed to use the hardware on the computer. The main things to look for at this point (although they may scroll by quickly) are hardware failures that may prevent some feature from working properly. Although much more rare than it used to be, there may be no driver available for a piece of hardware, or the wrong drive may get loaded and cause errors.
In addition to scrolling past on the screen, messages produced when the kernel boots are copied to the kernel ring buffer. As its name implies, the kernel ring buffer stores kernel messages in a buffer, throwing out older messages once that buffer is full. Once the computer boots up completely, you can log into the system and type the following command to capture these kernel messages in a file (then view them with the less command):
# dmesg > /tmp/kernel_msg.txt # less /tmp/kernel_msg.txt
I like to direct the kernel messages into a file (choose any name you like) so the messages can be examined later or sent to someone who can help debug any problems. The messages appear as components are detected, such as your CPU, memory, network cards, hard drives, and so on.
What you want to look for are drivers that fail to load or messages that show certain features of the hardware failed to be enabled. For example, I once had a TV tuner card (for watching television on my computer screen) that set the wrong tuner type for the card that was detected. Using information about the TV card's model number and the type of failure, I found that passing an option to the card's driver allowed me to try different settings until I found the one that matched my tuner card.
In describing how to view kernel startup messages, I have gotten ahead of myself a bit. Before you can log in and see the kernel messages, the kernel needs to finish bringing up the system. As soon as the kernel is done initially detecting hardware and loading drivers, it passes off control of everything else that needs to be done to boot the system to the init process.
The init process for Red Hat Enterprise Linux version 6 is currently in transition from using the well-known System V style for starting up services to a newer version of the init process called Upstart. In this section, I'll describe how the older init process works, and then mention how this is different in newer releases of Fedora and probably later versions of RHEL. (Chapter 15, “Starting and Stopping Services,” contains more details on the init process and start-up scripts.)
In RHEL, when the kernel hands off control of the boot process to the init process, the init process checks the /etc/inittab file for directions on how to boot the system. The inittab file tells the init process what the default runlevel is, and then points to files in the /etc/init directory to do such things as remap some keystrokes (such as Ctrl+Alt+Delete to reboot the system), start virtual consoles, and identify the location of the script for initializing basic services on the system: /etc/rc.sysinit.
When you're troubleshooting Linux problems that occur after the init process takes over, two likely culprits are the processing by the rc.sysinit file and the runlevel scripts.
As the name implies, the /etc/rc.sysinit script initializes many basic features on the system. When that file is run by init, rc.sysinit sets the system's hostname, sets up the /proc and /sys filesystems, sets up SELinux, sets kernel parameters, and performs dozens of other actions.
One of the most critical functions of rc.sysinit is to get the storage set up on the system. In fact, if the boot process fails during processing of rc.sysinit, in all likelihood, the script was unable to find, mount, or decrypt the local or remote storage devices needed for the system to run.
The following is a list of some common failures that can occur from tasks run from the rc.sysinit file and ways of dealing with those failures.
# mount -o remount,rw / # vim /etc/fstab # mount -a # reboot
Other features are set up by the rc.sysinit file as well. The rc.sysinit script sets the SELinux mode and loads hardware modules. The script constructs software RAID arrays and sets up Logical Volume Management volume groups and volumes. Troubles occurring in any of these areas are reflected in error messages that appear on the screen after the kernel boots and before runlevel processes start up.
In Red Hat Enterprise Linux 6.x and earlier, when the system first comes up, services are started based on the default runlevel. There are seven different runlevels, from 0 to 6. The default runlevel is typically 3 (for a server) or 5 (for a desktop). Here are descriptions of the runlevels in Linux systems up to RHEL 6:
Runlevels are meant to set the level of activity on a Linux system. A default runlevel is set in the /etc/inittab file, but you can change the runlevel any time you like using the init command. For example, as root you might type init 0 to shutdown, init 3 if you want to kill the graphical interface (from runlevel 5) but leave all other services up, or init 6 to reboot.
Normal default runlevels (in other words, the runlevel you boot to) are 3 (for a server) and 5 (for a desktop). Often servers don't have desktops installed and boot to runlevel 3 because they don't want to incur the processing overhead or the added security risks for having a desktop running on their web servers or file servers.
You can go either up or down with runlevels. For example, an administrator doing maintenance on a system may boot to runlevel 1 and then type init 3 to boot up to the full services needed on a server. Someone debugging a desktop may boot to runlevel 5 and then go down to runlevel 3 to try to fix the desktop (such as install a new driver or change screen resolution) before typing init 5 to return to the desktop.
The level of services at each runlevel is determined by the runlevel scripts that are set to start. There are rc directories for each runlevel: /etc/rc0.d/, /etc/rc1.d/, /etc/rc2.d/, /etc/rc3.d/, and so on. When an application has a start-up script associated with it, that script is placed in the /etc/init.d/ directory and then symbolically linked to a file in each /etc/rc?.d/ directory.
Scripts linked to each /etc/rc?.d directory begin with either the letter K or S, followed by two numbers and the service name. A script beginning with K indicates that the service should be stopped, while one beginning with an S indicates it should be started. The two numbers that follow indicate the order in which the service is started. Here are a few files you might find in the /etc/rc3.d/ directory, which are set to start up (with a description of each to the right):
This example of a few services started from the /etc/rc3.d directory should give you a sense of the order in which processes boot up when you enter runlevel 3. Notice that the sysstat service (which gathers system statistics) and the iptables service (which creates the system's firewall) are both started before the networking interfaces are started. Those are followed by rsyslog (system logging service) and then the various networked services.
By the time the runlevel scripts start, you should already have a system that is basically up and running. Unlike some other Linux systems that start all the scripts for runlevel 1, then 2, then 3, and so on, RHEL goes right to the directory that represents the runlevel, first stopping all services that begin with K and starting all those that begin with S in that directory.
As each S script runs, you should see a message saying whether or not the service started. Here are some things that might go wrong during this phase of system startup:
If you cannot get past a hanging service, you can reboot into an interactive startup mode, where you are prompted before starting each service. To enter interactive startup mode in RHEL, reboot and interrupt the boot loader (press any key when you see the 5 second countdown). Highlight the entry you want to boot and type e. Highlight the kernel line and type e. Then add the word confirm to the end of the kernel line, press Enter, and type b to boot the new kernel.
Figure 21.1 shows an example of the messages that appear when RHEL boots up in interactive startup mode.
Most messages shown in Figure 21.1 are generated from rc.sysinit.
After the Welcome message, udev starts (to watch for new hardware that is attached to the system and load drivers as needed). The hostname is set, Logical Volume Management (LVM) volumes are activated, all filesystems are checked (with the added LVM volumes), any filesystems not yet mounted are mounted, the root filesystem is remounted read-write, and any LVM swaps are enabled. Refer to Chapter 12 for further information on LVM and other partition and file system types.
The last “Entering interactive startup” message tells you that rc.sysinit is done and the services for the selected runlevel are ready to start. Because the system is in interactive mode, a message appears asking if you want to start the first service (sysstat). Type Y to start that service and go to the next one.
Once you see the broken service requesting to start, type N to keep that service from starting. If, at some point, you feel the rest of the services are safe to start, type C to continue starting the rest of the services. Once your system comes up, with the broken services not started, you can go back and try to debug those individual services.
One last comment about start-up scripts: The /etc/rc.local file is one of the last services to run at each runlevel. As an example, in runlevel 5 it is linked to /etc/rc5.d/S99local. Any command you want to run every time your system starts up can be put in the rc.local file. You might use rc.local to send an e-mail message or run a quick iptables firewall rule when the system starts. In general, it's better to use an existing startup script or create a new one yourself (so you can manage the command or commands as a service). Know that the rc.local file is a quick and easy way to get some commands to run each time the system boots.
After the last service starts, you are presented with a login screen. Your system should be fully operational. Start using the system or continue troubleshooting as needed. The next section describes how to troubleshoot issues that can arise with your software packages.
Software packaging facilities (such as yum for RPM and apt-get for DEB packages) are designed to make it easier for you to manage your system software. (See Chapter 10, “Getting and Managing Software,” for the basics on how to manage software packages.) Despite efforts to make it all work, however, sometimes software packaging can break.
The following sections describe some common problems you can encounter with RPM packages on a RHEL or Fedora system and how you can overcome those problems.
You try to install or upgrade a package using the yum command, and error messages tell you that the dependent packages you need to do the installation you want are not available. This can happen on a small scale (when you try to install one package) or a grand scale (where you are trying to update or upgrade your entire system).
Because of the short release cycles and larger repositories of Fedora and Ubuntu, inconsistencies in package dependencies are more likely to occur than they are in smaller, more stable repositories (such as those offered by Red Hat Enterprise Linux). To avoid dependency failures, here are a few good practices you can follow:
Once you encounter a dependency problem, there are a few things you can do to try to resolve the problem:
# yum -y --exclude=somepackage update
# min hour day/month month day/week command 59 23 * * * yum -y update | mail root@localhost
Information about all the RPM packages on your system is stored in your local RPM database. Although it happens much less often than it did with earlier releases of Fedora and RHEL, it is possible for the RPM database to become corrupted. This stops you from installing, removing, or listing RPM packages.
If you find that your rpm and yum commands are hanging or failing and returning an rpmdb open fails message, you can try rebuilding the RPM database. To verify that there is a problem in your RPM database, you can run the yum check command. Here is an example of what the output of that command looks like with a corrupted database:
# yum check error: db4 error(11) from dbenv->open: Resource temporarily unavailable error: cannot open Packages index using db4 - Resource temporarily unavailable (11) error cannot open Packages database in /var/lib/rpm CRITICAL:yum.main: Error: rpmdb open fails
The RPM database and other information about your installed RPM packages are stored in the /var/lib/rpm directory. You can remove the database files that begin with __________db* and rebuild them from the metadata stored in other files in that directory.
Before you start, it's a good idea to back up the /var/lib/rpm directory. Then you need to remove the old __________db* files and rebuild them. Type the following commands to do that:
# cp -r /var/lib/rpm /tmp # cd /var/lib/rpm # rm __db* # rpm --rebuilddb
New __________db* files should appear after a few seconds in that directory. Try a simple rpm or yum command to make sure the databases are now in order.
Just as RPM has databases of locally installed packages, the Yum facility stores information associated with Yum repositories in the local /var/cache/yum directory. Cached data includes metadata, headers, packages, and yum plug-in data.
If there is ever a problem with the data cached by yum, you can clean it out. The next time you run a yum command, necessary data is downloaded again. Here are some reasons for cleaning out your yum cache:
# yum clean all
At this point, your system will start picking up up-to-date information from repositories the next time a yum command is run.
The next section covers information about network troubleshooting.
With more and more of the information, images, video, and other content we use every day now available outside of our local computers, a working network connection is required on almost every computer system. So, if you drop your network connection or can't reach the systems you want to communicate with, it's good to know that there are many tools in Linux for looking at the problem.
For client computers (laptops, desktops, and handheld devices), you want to connect to the network to reach other computer systems. On a server, you want your clients to be able to reach you. The following sections describe different tools for troubleshooting network connectivity for Linux client and server systems.
You open your web browser, but are unable to get to any website. You suspect that you are not connected to the network. Maybe the problem is with name resolution, but it may be with the connection outside of your local network.
To check if your outgoing network connections are working, you can use many of the commands described in Chapter 14, “Administering Networking.” You can test connectivity using a simple ping command. To see if name-to-address resolution is working, use host and dig.
The following sections cover problems you can encounter with network connectivity for outgoing connections and what tools to use to uncover the problems.
To see the status of your network interfaces, use the ip command. The following output shows that the loopback interface (lo) is up (so you can run network commands on your local system), but eth0 (your first wired network card) is down (state DOWN). If the interface had been up, there would be an inet line showing the IP address of the interface. Here, only the loopback interface has an inet address (127.0.0.1).
# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN qlen 1000 link/ether f0:de:f1:28:46:d9 brd ff:ff:ff:ff:ff:ff
For a wired connection, make sure your computer is plugged into the port on your network switch. If you have multiple NICs, make sure the cable is plugged into the correct one. If you know the name of a network interface (eth0, p4p1, or other), to find which NIC is associated with the interface, type ethtool -p eth0 from the command line and look behind your computer to see which NIC is blinking (Ctrl+C stops the blinking). Plug the cable into the correct port.
If, instead of seeing an interface that is down, the ip command shows no interface at all, check that the hardware isn't disabled. For a wired NIC, the card may not be fully seated in its slot or the NIC may have been disabled in the BIOS.
On a wireless connection, you may click on the NetworkManager icon and not see an available wireless interface. Again, it could be disabled in the BIOS. However, on a laptop, check to see if there is a tiny switch that disables the NIC. I've seen several people shred their networking configurations only to find that this tiny switch on the front or side of their laptops had been switched to the off position.
If your network interface is up, but you still can't reach the host you want to reach, try checking the route to that host. Start by checking your default route. Then try to reach the local network's gateway device to the next network. Finally, try to ping a system somewhere on the Internet:
# route route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.0.0 * 255.255.255.0 U 2 0 0 eth0 default 192.168.0.1 0.0.0.0 UG 0 0 0 eth0
The default line shows that the default gateway (UG) is at address 192.168.0.1 and that the address can be reached over the eth0 card. Because there is only the eth0 interface here and only a route to the 192.168.0.0 network is shown, all communication not addressed to a host on the 192.168.0.0/24 network is sent through the default gateway (192.168.0.1). The default gateway is more properly referred to as a router.
To make sure you can reach your router, try to ping it. For example:
# ping -c 2 192.168.0.1 PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data. From 192.168.0.105 icmp_seq=1 Destination Host Unreachable From 192.168.0.105 icmp_seq=2 Destination Host Unreachable --- 192.168.0.1 ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss
The “Destination Host Unreachable” message tells you that the router is either turned off or not physically connected to you (maybe the router isn't connected to the switch you share). If the ping succeeds and you can reach the router, the next step is to try an address beyond your router.
Try to ping a widely accessible IP address. For example, the IP address for the Google public DNS server is 8.8.8.8. Try to ping that (ping -c2 8.8.8.8). If that ping succeeds, your network is probably fine, and it is most likely your hostname-to-address resolution that is not working properly.
If you can reach a remote system, but the connection is very slow, you can use the traceroute command to follow the route to the remote host. For example, this command shows each hop taken en route to http://www.google.com:
# traceroute www.google.com
The output will show the time taken to make each hop along the way to the Google site. Instead of traceroute, you can use the mtr command (yum install mtr) to watch the route taken to a host. With mtr, the route is queried continuously, so you can watch the performance of each leg of the journey over time.
If you cannot reach remote hosts by name, but you can reach them by pinging IP addresses, your system is having a problem with hostname resolution. Systems connected to the Internet do name-to-address resolution by communicating to a domain name system (DNS) server that can provide them with the IP addresses of the requested hosts.
The DNS server your system uses can be entered manually or picked up automatically from a DHCP server when you start your network interfaces. In either case, the names and IP addresses of one or more DNS servers will end up in your /etc/resolv.conf file. Here is an example of that file:
search example.com nameserver 192.168.0.254 nameserver 192.168.0.253
When you ask to connect to a hostname in Fedora or Red Hat Enterprise Linux, the /etc/hosts file is searched; then the first name server entry in resolv.conf is queried; then each subsequent name server is queried. If a hostname you ask for is not found, all those locations are checked before you get some sort of “Host Not Found” message. Here are some ways of debugging name-to-address resolution:
# host www.google.com 192.168.0.254 # dig @192.168.0.254 www.google.com
The procedures just described for checking your outgoing network connectivity apply to any type of system, whether it is a laptop, desktop, or server. For the most part, incoming connections are not an issue with laptops or desktops because most requests are simply denied. However, for servers, the next section describes ways of making your server accessible if clients are having trouble reaching the services you provide from that server.
If you are troubleshooting network interfaces on a server, there are different considerations than on a desktop system. Because most Linux systems are configured as servers, you should know how to troubleshoot problems encountered by those who are trying to reach your Linux servers.
I'll start with the idea of having an Apache web server (httpd) running on your Linux system, but no web clients are able to reach it. The following sections describe the things you can try to see where the problem is.
To be a public server, your system's hostname should be resolvable so any client on the Internet can reach it. That means locking down your system to a particular, public IP address and registering that address with a public DNS server. You can use a domain registrar (such as http://www.networksolutions.com) to do that.
When clients cannot reach your website by name from their web browsers, if the client is a Linux system, you can go through ping, host, traceroute, and other commands described in the previous section to track down the connectivity problem. Windows systems have their own version of ping that you can use from those systems.
If the name-to-address resolution is working to reach your system and you are able to ping your server from the outside, the next thing to try is the availability of the service.
From a Linux client, you can check if the service you are looking for (in this case httpd) is available from the server. One way to do that is using the nmap command.
The nmap command is a favorite tool for system administrators checking for various kinds of information on networks. However, it is a favorite cracker tool as well because it can scan servers, looking for potential vulnerabilities. So, it is fine to use nmap to scan your own systems to check for problems. But know that using nmap on another system is like checking the doors and windows on someone's house to see if you can get in. You will look like an intruder.
Checking your own system to see what ports to your server are open to the outside world (essentially, checking what services are running) is perfectly legitimate and easy to do. Once nmap is installed (yum install nmap), use your system hostname or IP address to use nmap to scan your system to see what is running on common ports:
# nmap 192.168.0.119 Starting Nmap 5.21 ( http://nmap.org ) at 2012-06-16 08:27 EDT Nmap scan report for spike (192.168.0.119) Host is up (0.0037s latency). Not shown: 995 filtered ports PORT STATE SERVICE 21/tcp open ftp 22/tcp open ssh 80/tcp open http 443/tcp open https 631/tcp open ipp MAC Address: 00:1B:21:0A:E8:5E (Intel Corporate) Nmap done: 1 IP address (1 host up) scanned in 4.77 seconds
The preceding output shows that TCP ports are open to the regular (http) and secure (https) web services. When you see that the state is open, that indicates there is a service listening on the port as well. If you get to this point, it means your network connection is fine and that you should direct your troubleshooting efforts to how the service itself is configured (for example, you might look in /etc/httpd/conf/httpd.conf to see if there are specific hosts allowed or denied access).
If TCP ports 80 and/or 443 are not shown, it means they are being filtered. You need to check if your firewall is blocking (not accepting packets to) those ports. If the port is not filtered, but the state is closed, it means that the httpd service either isn't running or is not listening on those ports. The next step is to log into the server and check those issues.
From your server, you can use the iptables command to list the filter table rules that are in place. Here is an example:
# iptables -vnL Chain INPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination ... 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:80 0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:443 ...
There should be firewall rules like the two shown in the preceding code among your other rules. If there aren't, add those rules to the /etc/sysconfig/iptables file. Here are examples of what those rules might look like:
-A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 443 -j ACCEPT
With the rules added to the file, clear out all of your firewall rules (systemctl stop iptables.service or service iptables stop), and then start them again (systemctl start iptables.service or service iptables start).
If the firewall is still blocking client access to the web server ports, here are a few things to check in your firewall:
If the port is now open, but the service itself is closed, check that the service itself is running and listening on the appropriate interfaces.
If there seems to be nothing blocking client access to your server through the actual ports providing the service you want to share, it is time to check the service itself. Assuming the service is running (type service httpd status to check), the next thing to check is that it is listening on the proper ports and network interfaces.
The netstat command is a great general-purpose tool for checking network services. The following command lists the names and process IDs (p) for all processes that are listening (l) for TCP (t) and UDP (u) services, along with the port number (n) they are listening on. The following command line filters out all lines except those associated with the httpd process:
# netstat -tupln | grep httpd tcp 0 0 :::80 :::* LISTEN 2567/httpd tcp 0 0 :::443 :::* LISTEN 2567/httpd
The previous example shows that the httpd process is listening on port 80 and 443 for all interfaces. It is possible that the httpd process might be listening on selected interfaces. For example, if the httpd process were only listening on the local interface (127.0.0.1) for HTTP requests (port 80) the entry would look as follows:
tcp 0 0 127.0.0.1:80 :::* LISTEN 2567/httpd
For httpd, as well as for other network services that listen for requests on network interfaces, you can edit the service's main configuration file (in this case, /etc/httpd/conf/httpd.conf) to tell it to listen on port 80 for all addresses (Listen 80) or a specific address (Listen 192.168.0.100:80).
Troubleshooting performance problems on your computer is one of the most important, although often elusive, tasks you need to complete. Maybe you have a system that was working fine, but begins to slow down to a point where it is practically unusable. Maybe applications begin to just crash for no apparent reason. Finding and fixing the problem may take some detective work.
Linux comes with many tools for watching activities on your system and figuring out what is happening. Using a variety of Linux utilities, you can do such things as find out which processes are consuming large amounts of memory or placing high demands on your processors, disks, or network bandwidth. Solutions can include:
To troubleshoot performance problems in Linux, you use some of the basic tools for watching and manipulating processes running on your system. Refer to Chapter 6, “Managing Running Processes,” if you need details on commands such as ps, top, kill, and killall. In this section, I add commands such as memstat to dig a little deeper into what processes are doing and where things are going wrong.
The most complex area of troubleshooting in Linux relates to managing virtual memory. The next sections describe how to view and manage virtual memory.
Computers have ways of storing data permanently (hard disks) and temporarily (Random Access Memory, or RAM, and swap space). Think of yourself as a CPU, working at a desk trying to get your work done. You would put data that you want to keep permanently in a filing cabinet across the room (that's like hard disk storage). You would put information that you are currently working on on your desk (that's like RAM memory on a computer).
Swap space is a way of extending RAM. It is really just a place to put temporary data that doesn't fit in RAM but is expected to be needed by the CPU at some point. Although swap space is on the hard disk, it is not a regular Linux filesystem in which data is stored permanently. Think of swap space as a file cabinet drawer in which information is held in a miscellaneous bin, where it can either be sorted into the permanent file cabinets or brought back to be used on the desk.
Compared to disk storage, Random Access Memory has the following attributes:
It is important to understand the difference between temporary (RAM) and permanent (hard disk) storage, but that doesn't tell the whole story. If the demand for memory exceeds the supply of RAM, the kernel can temporarily move data out of RAM to an area called swap space.
If we revisit the desk analogy, this would be like saying, “There is no room left on my desk, yet I have to add more papers to it for the projects I'm currently working on. Instead of storing papers I'll need soon in a permanent file cabinet, I'll have one special file cabinet (like a desk drawer) to hold those papers I'm still working with, but that I'm not ready to store permanently or throw away.”
Refer to Chapter 12, “Managing Disks and Filesystems,” for more information on swap files and partitions and how to create them. For the moment, however, there are a few things you should know about these kinds of swap areas and when they are used:
The rule of thumb has always been that swapping is bad and should be avoided. However, there are some who would argue that in certain cases, more aggressive swapping can actually improve performance.
Think of the case where you open a document in a text editor and then minimize it on your desktop for several days as you work on different tasks. If data from that document were swapped out to disk, more RAM would be available for more active applications that could put that space to better use. The performance hit would come the next time you needed to access the data from the edited document and the data was swapped in from disk to RAM. The settings that relate to how aggressively a system will swap are referred to as swappiness.
As much as possible, Linux wants to make everything that an open application needs immediately available. So, using the desk analogy, if I am working on nine active projects and there is space on the desk to hold the information I need for all nine projects, why not leave them all within reach on the desk? Following that same way of thinking, the kernel sometimes keeps libraries and other content in RAM that it thinks you might eventually need, even if a process is not looking for it immediately.
The fact that the kernel is inclined to store information in RAM that it expects may be needed soon (even if it is not needed now) can cause an inexperienced system administrator to think that the system is almost out of RAM and that processes are about to start failing. That is why it is important to know the different kinds of information being held in memory—so you can tell when real out-of-memory situations can occur. The problem is not just running out of RAM; it is running out of RAM when only non-swappable data is left.
Keep this general overview of virtual memory (RAM and swap) in mind, as the next section describes ways to go about troubleshooting issues related to virtual memory.
Let's say that you are logged into a Linux desktop, with lots of applications running, and everything begins to slow down. To find out if the performance problems have occurred because you have run out of memory, you can try commands such as top and ps to begin looking for memory consumption on your system.
To run the top command to watch for memory consumption, type top and then type a capital M. Here is an example:
# top top - 22:48:24 up 3:59, 2 users, load average: 1.51, 1.37, 1.15 Tasks: 281 total, 2 running, 279 sleeping, 0 stopped, 0 zombie Cpu(s): 16.6%us, 3.0%sy, 0.0%ni, 80.3%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 3716196k total, 2684924k used, 1031272k free, 146172k buffers Swap: 4194296k total, 0k used, 4194296k free, 784176k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6679 cnegus 20 0 1665m 937m 32m S 7.0 25.8 1:07.95 firefox 6794 cnegus 20 0 743m 181m 30m R 64.8 5.0 1:22.82 npviewer.bin 3327 cnegus 20 0 1145m 116m 66m S 0.0 3.2 0:39.25 soffice.bin 6939 cnegus 20 0 145m 71m 23m S 0.0 2.0 0:00.97 acroread 2440 root 20 0 183m 37m 26m S 1.3 1.0 1:04.81 Xorg 2795 cnegus 20 0 1056m 22m 14m S 0.0 0.6 0:01.55 nautilus
There are two lines (Mem and Swap) and four columns of information (VIRT, RES, SHR, and %MEM) relating to memory in the top output. In this example, you can see that RAM is not exhausted from the Mem line (only 268492k of 3716196k is used) and that nothing is being swapped to disk from the Swap line (0k used).
However, adding up just these first six lines of output in the VIRT column, you would see that 4937MB of memory has been allocated for those applications, which exceeds the 3629MB of total RAM (3716196k) that is available. That's because the VIRT column shows only the amount of memory that has been promised to the application. The RES line shows the amount of non-swappable memory that is actually being used, which totals only 1364MB.
Notice that, when you ask to sort by memory usage by typing a capital M, top knows to sort on that RES column. The SHR column shows memory that could potentially be shared by other applications (such as libraries), and %MEM shows the percentage of total memory consumed by each application.
If you think that the system is reaching an out-of-memory state, here are a few things to look for:
In the short term, there are several things you can do to deal with this out-of-memory condition:
# echo 3 > /proc/sys/vm/drop_caches
If your Linux system becomes unbootable, your best option for fixing it is probably to go into rescue mode. To go into rescue mode, you bypass the Linux system installed on your hard disk and boot some rescue medium (such as a bootable USB key or boot CD). After the rescue medium boots, it tries to mount any filesystems it can find from your Linux system so you can repair any problems.
For many Linux distributions, the installation CD or DVD can serve as boot media for going into rescue mode. Here's an example of how to use a Fedora installation DVD to go into rescue mode to fix a broken Linux system:
Once you are in rescue mode, the portion of your filesystem that is not damaged will be mounted under the /mnt/sysimage directory. Change to that directory (cd /mnt/sysimage) and type ls to check that the files and directories from the hard disk are there.
Right now, the root of the filesystem (/) is from the filesystem that comes on the rescue medium. To troubleshoot your installed Linux system, however, you can type the following command:
# chroot /mnt/sysimage
Now the /mnt/sysimage directory becomes the root of your filesystem (/) so that it looks like the filesystem installed on your hard disk. Here are some things you can do to repair your system while you are in rescue mode:
When you are done fixing on your system, type exit to exit the chroot environment and return to the filesystem layout that the live medium sees. If you are completely done, type reboot to restart your system. Be sure to pop out the medium before the system restarts.
Troubleshooting problems in Linux can start from the moment you turn on your computer. Problems can occur with your computer BIOS, boot loader, or other parts of the boot process that you can correct by intercepting them at different stages of the boot process.
After the system has started, you can troubleshoot problems with software packages, network interfaces, or memory exhaustion. Linux comes with many tools for finding and correcting any part of the Linux system that might break down and need fixing.
The next chapter covers the topic of Linux security. Using the tools described in that chapter, you can provide access to those services you and your users need, while blocking access to system resources that you want to protect from harm.
The exercises in this section enable you to try out useful troubleshooting techniques in Linux. Because some of the techniques described here can potentially damage your system, I recommend you do not use a production system that you cannot risk damaging. See Appendix B for suggested solutions.
These exercises relate to troubleshooting topics in Linux. To do these exercises, you need to be able to reboot your computer and interrupt any work it may be doing.