Chapter 24
Troubleshooting Application and Hardware Issues

  • Objective 4.4: Given a scenario, analyze and troubleshoot application and hardware issues

images A Linux system’s primary purpose is to serve. However, if one of its applications or the hardware it uses is not functioning properly, the system cannot fulfill its duty. Understanding common and a few uncommon problems with both applications and hardware will help you quickly resolve any issues.

Dealing with Storage Problems

Troubleshooting storage issues ranges from the easy-to-check items all the way to the strange and obscure. For example, if you just installed a drive, test connections to ensure they are tight. Disks that were previously working fine may suffer from degrading storage. These issues and more are covered in the following sections.

Exploring Common Issues

If you are fairly new to Linux system administration, most likely you are unaware of common storage problems. The following can help you prepare.

Degraded Storage/Mode Degraded storage refers to the storage medium’s gradual decay due to time or improper use, which causes data degeneration or loss. For example, an SSD has limited endurance due to its finite number of program/erase (PE) cycles. Thus, employing an SSD in your swap space is unwise.

Degraded mode refers to a situation in which one or more disks in a RAID array have failed. In this case, troubleshooting efforts require you to employ the mdadm -D command to view a particular array’s detailed status. If the state contains the word degraded, add another partition to the array so it can recover.

Missing Devices Storage devices can go “missing” on Linux, but the cause varies. If it is network attached storage (NAS), check your network first (see Chapter 20).

If it is a locally attached device and other utilities, such as lsblk, are not displaying it, use super user privileges and try the lspci -M command. This command will perform a thorough scan of all PCI attached devices.

The conduit to Linux devices is through the device files, such as /dev/sdb. Ensure that the particular partition’s device file is available and not corrupted. If needed, rebuild it via the mknod command.

Check that you (or the utility configuration) are using the correct device file name. A whole disk is referred to by the device file name with no numbers, such as /dev/sdc. A disk partition is specified by the device file name and its number, like /dev/sdc2. When using an NvME SSD, the device file name, such as /dev/nvme0n1p1, has extra items, including the namespace.

Missing Volumes Another form of a lost device is a missing volume. If you perform a pvscan on the physical devices that make up a logical volume and get a “Couldn’t find device” message, you’ve got a missing volume. Typically, the cause is a failed or unintentionally removed disk.

If a disk that was part of a logical volume’s group has failed, the missing disk’s UUID will display in the pvscan message. You can replace the failed volume (pvcreate), restore the group’s metadata (vgcfgrestore), recover the group (vgscan), and then activate it (vgchange) via LVM tools.

Missing Mount Points A “Mount point does not exist” error message implies the obvious—the directory on which you are attempting to mount the filesystem does not exist. It either was deleted or never created in the first place. Simply make it with appropriate privileges via the mkdir command.

However, this error message can also be generated for a not-so-obvious problem. It centers on employing the bind option, either at the command line via the mount command or in the /etc/fstab file. This option moves a filesystem from its current mount point to a new mount point. If it is not already mounted somewhere, you’ll get a “Mount point does not exist” error message.

images Before removing a directory, check if it is a mount point. You can do that quickly by employing the mountpoint directory-name command.

Storage Integrity A bad block (also called a bad sector) is a small chunk of a disk drive that will not respond to I/O requests due to corruption or physical damage. A random bad block does not indicate a drive is failing, but these storage devices need monitoring, because increasing bad sectors indicate it needs replacing.

Besides using the fsck command (covered in Chapter 11), you can employ the badblocks utility to monitor a drive. It is different from fsck in that it focuses on a particular partition and does not perform any repairs. It is wise to back up and unmount a partition prior to checking for bad sectors. Use the non-destructive test, by issuing the badblocks -nsv partition-device-file command. The utility provides progress as it runs, and when the tests are complete, it issues a final bad blocks status.

In addition, a disk’s manufacturer often provides its own set of testing programs. Typically these programs let you know whether or not to replace the drive but do not provide detailed data on bad sectors.

images The dmesg command displays the kernel ring buffer, which can contain messages such as disk I/O errors. These are indicators of potential problems.

Performance Issues Poor storage performance adversely affects applications. Besides using utilities such as iostat, ioping, iotop, and sar (covered in Chapter 20) to monitor storage performance problems, you can also employ hdparm to determine a drive’s read speeds. This utility is useful for PATA or SATA drives. SCSI drives that have SCSI/ATA command translation are also supported.

The dstat utility is similar to iostat but provides additional helpful data for troubleshooting storage performance problems. For example, this tool displays throughput stats associated with network use or per individual LV drives.

Another handy utility that works specifically with logical volumes is the dmstats utility. This tool allows the setup and management of statistics for any devices charted by the device mapper. You can determine device mapper file names associated with logical volumes via the lsblk -p utility.

A GUI tool that gauges disk performance is the gnome-disks utility. However, back up any of the disk’s data prior to performing a write benchmark.

Resource Exhaustion Resource exhaustion is a situation in which a system’s finite resources are committed and unavailable to others. Running out of inode numbers or disk space (covered in Chapter 20) are two examples.

images Threat agents can engage in a storage resource exhaustion attack via file descriptor leaks. A file descriptor is commonly used in programming languages to access a file, pipe, or network socket. You can prevent this attack type by setting the PAM nofile limit with in the /etc/security/limits.conf file. (PAM was covered in Chapter 16.)

Dealing with Specialized Issues

One of the first things you should check for an older storage device experiencing problems is whether or not the device’s manufacturer has a new driver or firmware available. Often this can resolve a tricky issue.

Another item to check is the device’s Linux module (driver). If it is not loaded or built into the Linux kernel, your device will not function. Start with the dmesg utility to gain some clues. A snipped example is shown in Listing 24.1.

Listing 24.1: Looking up disk information via the dmesg command

# dmesg | grep sde
[…]
 [    5.566479] sd 6:0:0:0: [sde] Attached SCSI disk

The dmesg utility’s output is searched using grep to find information concerning the sde disk (/dev/sde). The important clue here is that the disk is an attached SCSI disk.

The available SCSI disk driver information is stored within a /sys/ directory as shown in Listing 24.2.

Listing 24.2: Determining the driver via the ls and udevadm commands

# ls /sys/bus/scsi/drivers
sd  sr
#
# udevadm info -an /dev/sde | grep DRIVERS | grep sd
    DRIVERS=="sd"

Notice that the sd and sr drivers are used for SCSI devices. The udevadm command confirms which one is employed for the /dev/sde disk.

After the driver (module) is determined, use the lsmod command to see if it is currently loaded into the kernel. A snipped example is shown in Listing 24.3.

Listing 24.3: Determining if the module is loaded via the lsmod command

# lsmod | grep sd
sd_mod                 46322  5
[…]
# modinfo sd_mod
filename:       /lib/modules/3.10.0-
862.11.6.el7.x86_64/kernel/drivers/scsi/sd_mod.ko.xz
[…]
description:    SCSI disk (sd) driver
[…]

If the module is not loaded, it may be built into the kernel. You can check this by looking at the modules.builtin file as shown snipped in Listing 24.4.

Listing 24.4: Determining if the module is built in via the cat command

$ cat /lib/modules/$(uname -r)/modules.builtin | grep sd_mod
kernel/drivers/scsi/sd_mod.ko

If the module is not loaded nor built into the kernel, dynamically load it using super user privileges and the modprobe command (Chapter 14).

Seeking SATA

An adapter is a piece of hardware that may have one or more software interfaces. The various storage interfaces, such as SATA drives, can have unique problems. On Linux, SATA drives are self-configuring. They are typically connected to the SCSI bus and are denoted by the /dev/sd* device files.

images If you are using a Linux distro with a kernel version prior to 2.6.16 (released March 2006), be aware that SATA suspend and resume is not supported. The system will hang when the device is accessed after a resume operation. Fix this problem by adding SATA power management support via a kernel patch.

On Linux, some SATA devices may fail earlier than others due to frequent head loads and unloads. Often this is due to aggressive power management.

You can check for this situation on a SATA drive, if it uses self-monitoring analysis and reporting technology (SMART), via the smartctl -a command. Look at the Start_Stop_Count, which is the number of loads and unloads. For a particular disk, a high count compared to other drives is indicative of this problem. Double-check it via the hdparm -B command on the drive. If the command returns a low number, such as 1, then aggressive power management is confirmed. You can modify this by using super user privileges and typing in hdparm -B 127 device-filename, which will not only remove the aggressive power management but also typically improves performance and extends the drive’s life.

images If you are using a virtual machine, the smartctl command will fail. This is due to virtualized disks not supporting SMART.

Comprehending SCSI

On Linux, the SCSI framework consists of three integral parts:

  • Upper: The device driver (for example, disk driver) layer
  • Middle: The SCSI routing layer
  • Lower: The host bus adapter (HBA) driver layer

The upper layer is closest to the application or user command, while the lower SCSI layer is right next to the actual hardware. The HBA is either a circuit board or an integrated circuit adapter, which connects to the disk drive. Just like device drivers, the HBA driver is either loaded or built into the kernel.

Problems can occur if either the HBA or device driver is not loaded or built into the kernel. Earlier, in Listing 24.3 and Listing 24.4, a check was done for a SCSI upper-layer driver. In Listing 24.5, a snipped example shows looking for the HBA driver (module) and checking whether or not it is loaded or built in.

Listing 24.5: Determining a module name and if it is loaded

# udevadm info -an /dev/sda | grep -i drivers
    DRIVERS=="sd"
[…]
    DRIVERS=="ahci"
[…]
#
# lsmod | grep ahci
ahci                   34056  3
[…]
#
# modinfo ahci
[…]
description:    AHCI SATA low-level driver
[…]

Notice that the HBA driver is the advance host controller interface (ACHI) driver, and it is loaded into the kernel. This particular driver allows you to hot-plug SATA drives, which are treated as SCSI devices. In other words, the SATA drives are attached to the SCSI framework.

When you attach a SATA drive as a hot-plug SCSI device, you will need to enable it. This is accomplished by either rebooting the system or modifying the /sys/class/scsi_host/host#/scan file. The # is the drive’s SCSI host number. An example of determining the appropriate host number and modifying the file is shown snipped in Listing 24.6.

Listing 24.6: Enabling a hot-plugged SATA drive

# lsblk -S
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
[…]
sde  6:0:0:0    disk ATA      VBOX HARDDISK    1.0  sata
[…]
#
# echo '- - -' > /sys/class/scsi_host/host6/scan

In this example, the lsblk -S command only shows attached SCSI framework devices, and the SATA drive is sde (/dev/sde). The HTCL column in the output shows the device’s host number (first number prior to the first colon). In this case, the host number is 6. After the disk’s host number is determined, the characters '- - -' are echoed into the appropriate file (note the required spaces between each dash). This action forces the system to scan the device attached to the SCSI framework at that host number, which enables the drive.

Moderating RAID

A Linux system can employ software and hardware RAID. Software RAID arrays are implemented through the Multiple Devices (md) driver. Check the status of your software RAID array by viewing the /proc/mdstat file.

If it is SATA based and a drive goes offline, a software RAID array can hang. This occurs if the HBA does not handle hot-plug action. Thus, it is wise to check if your lower SCSI framework layer’s driver supports hot-plugging. If your HBA uses the AHCI module (driver), hot-plugging is allowed.

Hardware-based RAID arrays are managed via a hardware device connected to the Linux SCSI framework. A hardware RAID controller’s data, such as the manufacturer and model number, are obtained using super user privileges and entering lspci -knn | grep "RAID bus controller" at the command line. This is useful if you inherited a Linux system and need to obtain manufacturer utilities to troubleshoot and monitor a hardware-based RAID array.

Uncovering Application Permission Issues

A user notifies you that an application has issued an I/O error when they attempt to run it. The problem is possibly a permission issue. You will need to gather some information prior to starting your troubleshooting:

  1. Determine which account runs the application and the account’s name.
  2. Discover the specific program action that raised the error.
  3. Obtain a full directory reference for any files on which the application was attempting to perform reads/writes or for any files it was attempting to create.
  4. Record, if applicable, any additional applications it was trying to launch.
  5. Document, if applicable, any local or remote services the application is attempting to employ, such as NTP or a file server (Chapter 2).

images If the application uses services, such as OpenSSH and/or an authentication server, it is important to know what service accounts are involved. Record this information as well.

When you have these details, you are ready to proceed in your troubleshooting process.

Ownership Look at the various application files involved using the ls -l command. Determine what username owns the files and the permissions granted to those owners. Don’t forget to look at the directory permissions as well. You’ll need to know the entire directory tree’s owners and permissions.

If the application is not run under a username that owns the file or the directory tree, you’ll need to go on to group memberships and possibly other permissions. File and directory permission troubleshooting was covered in Chapter 22, if you need a refresher.

Group Memberships Uncover the groups to which the end user running the application belongs. If the application is run under a different account, check that account’s group memberships.

When you have that information, you can check the application files involved. Identify the group permissions of those files as well as the directory tree to uncover any potential problems.

Executables If the application cannot be run by a particular account, check the execute privileges. Keep in mind that if the application kicks off additional programs, you will need to check the privileges for those as well.

If you are using a script that changes its present working directory and it fails, then check the directory tree it is trying to access. The execute privilege must be granted on every single directory within the tree in order for an account to change its present working directory to that particular location.

Inheritance If the application is creating files in a particular directory and can no longer access those files, check for forced inheritance via ACLs (covered in Chapter 15). If the directory has a default ACL, any files created within that directory that do not have ACLs set specifically for them will inherit their ACL from the directory. You can view default directory ACLs via the getfacl -d or --default command.

If you find that a directory’s default ACL is behind the problem, consider removing the default ACL and defining the needed ACL on the directory. Another alternative is to explicitly set the application file’s ACL, which will override the inherited directory default ACL. Employ the setfacl utility to enact these changes.

Exercise 24.1 Troubleshooting application permission issues

You can try out basic application permission problem troubleshooting via this simple activity.

  1. Log into your Linux system via a tty terminal, using a non-root account that can access super user privileges via the sudo command.
  2. At the command-line prompt, type touch /tmp/fileA.txt and press the Enter key. The touch command was covered in Chapter 3.
  3. Change the newly created file’s owner and group to root by typing in sudo chown root:root /tmp/fileA.txt and press Enter. If requested, enter the account’s sudo password. The sudo and chown commands were covered in Chapter 15.
  4. Next you will create a small application using the nano text editor (covered in Chapter 4). Type nano application.sh and press Enter. This will put you into the nano text editor.
  5. Type in the following, pressing the Enter key as needed.

  • #!/bin/bash
  • echo "Creating file /tmp/Activity.txt…"
  • echo "Hello World" > /tmp/Activity.txt
  • echo "Removing file…"
  • rm -ir /tmp/*.*
  • exit

  1. Press Ctrl+O and then the Enter key to write out the text editor’s buffer to the activity.sh file.
  2. Press Ctrl+X to leave the text editor.
  3. Run the application by typing bash activity.sh and pressing the Enter key.
  4. When asked a question, type y at the prompt and press Enter. You should receive at least one error message relating the attempt to delete the /tmp/fileA.txt file, but you may receive more error messages depending upon what files and directories are currently located in the /tmp/ directory.
  5. Now you can begin the troubleshooting process. Since you ran the application, record your user account’s name.
  6. Document the action that causes the error to occur. (Hint: It was associated with the /tmp/fileA.txt file.)
  7. Record the problem file’s full directory reference. (Hint: Look at the previous step.)
  8. Document that the application is not trying to launch any additional applications nor employing local or remote services.
  9. Display the problem file’s directory’s ownership, group membership, and various permissions by typing ls -ld /tmp and pressing Enter.
  10. Record the directory’s owner and its associated permissions.
  11. Document the directory’s group and its associated permissions.
  12. Record the directory’s other permissions. Note that if you see a t in the permissions, this refers to the sticky bit (covered in Chapter 15).
  13. Determine which of the three directory permission sets (owner, group, or other) would apply to your user account and record it.
  14. Using Table 22.1 in Chapter 22 and the information you uncovered in the last several steps, discover the cause of this application problem. Record your theory.
  15. The application’s problem was caused by using the /tmp/*.* file wildcard designation as the rm -ir command’s argument. If the sticky bit is set on the /tmp/ directory, your account can only delete files from that directory, which you own. Therefore, to fix the problem, if desired, change the rm -ir /tmp/*.* line in the activity.sh application to rm -i /tmp/activity.txt instead.

Analyzing Application Dependencies

In Chapter 13, we covered using package management commands, such as apt-cache depends and yum deplist to display a repository-managed application’s dependencies. There are some special problems you can run into with programs and their various dependencies. The more common ones associated with the certification exam are examined here.

Versioning

Typically, application software programs and operating systems are continually updated. These updates may improve performance or add additional functionality. To keep track of the various application updates, a technique called versioning is employed. Versioning is the management of multiple application software updates through a numbering process. Different versions (releases) of an application have different numbers, which typically increase for newer application updates.

For example, the Linux kernel version 2.6.0 was released in December 2003. Current kernel versions can be found at www.kernel.org and have numerically higher numbers compared to the 2.6.0 version. These higher numbers indicate newer releases.

You can use versioning to determine if application updates or patches have been released. This is helpful when troubleshooting application software issues.

Updating Issues

If an application is experiencing problems, check for a new software update. If the application is available through a repository, use your distro’s particular package management to check for a new version (see Chapter 13).

Consider setting up a test system with the application environment and apply application updates to it. You can conduct thorough planned tests prior to updating production applications. Before applying any production application updates, ensure that you have an excellent backup. These two items should protect you from most bad situations.

If an application begins experiencing problems after a recent update, review the distro’s package management history information. For example, on a system using RPM, using super user privileges you can issue the command rpm -q package-name --last to see the latest update history for the package-name package. On a Debian system, check the /var/log/apt/history.log file. You’ll need to uninstall any installed packages or libraries causing the problem using the appropriate package management utility (see Chapter 13).

images On modern Ubuntu distro versions, unattended upgrades are configured. This allows automatic security upgrades to software and requires no human intervention. If you desire to turn this off, change the APT::Periodic::Update-Package-Lists directive in the /etc/apt/apt.conf.d/10periodic file from 1 to 0. Found out more about this feature by typing in man unattended-upgrade at the command line.

When one software package depends upon another package or library to operate properly, it is called a dependency. A broken dependency (also called an unmet dependency) is an undesirable situation, where a software package has been installed but one or more of its needed packages or libraries are not installed. Sometimes a package upgrade can break dependencies, resulting in what is called a broken package.

To check for broken dependencies, on a Debian package management system, use the apt-get check command. The YUM package manager will not update programs that cause a broken dependency, but if you installed a program using the rpm utility that caused problems, you can issue the rpm -aV command on any Red Hat package managed system to see damaged software packages.

Patching

A patch refers to program changes or configuration file updates for a particular application or system service. Patches may correct serious problems or fix security vulnerabilities and are often issued out of the normal software update cycle. Patching is the act of applying a patch. It does not necessarily involve updating all your system’s software. There are many conflicting theories on patch management, but at the heart of the issue is keeping your applications and Linux system running smoothly, safely, and effectively for your users.

A kernel patch release is a little different. It is a special source code package that only contains the changes applied to the major kernel source code release. You just download the patch source code package, use the Linux patch command to apply the patch updates to the existing kernel source code files on your system, and then recompile the kernel. Typically, your package manager handles this for you when you update packages (software) on the system.

Dealing with Libraries

Application functions are often split into separate library files (shared libraries) so that multiple applications that use the same functions can share these library files. Libraries were first covered in Chapter 13.

If an application begins experiencing problems after a software upgrade, it may be related to a recently upgraded shared library the application employs. You can check which libraries a program uses by typing in ldd program-name at the command line. It is helpful to redirect this command’s output into another file. Use the grep command to search package management log files to determine if one of the application’s libraries was recently updated. An example is shown snipped in Listing 24.7.

Listing 24.7: Using ldd and grep to discover a recently upgraded library

$ which ssh
/usr/bin/ssh
$
$ ldd /usr/bin/ssh > lib.txt
$
$ cat lib.txt
[…]
        libk5crypto.so.3 […]
[…]
$
$ grep -B 2 -A 2 libk5crypto /var/log/apt/history.log
Start-Date: 2019-01-22  13:37:45
Commandline: /usr/bin/unattended-upgrade
Upgrade: libk5crypto3:amd64 (1.16-2build1, 1.16-2ubuntu0.1)
End-Date: 2019-01-22  13:37:49

$

The application used in this example is the OpenSSH command. Notice one of its libraries was recently upgraded. (The -B and -A options on the grep command allow you to pull additional lines below and above the found content.) If it began experiencing problems shortly after this library upgrade, you have a probable cause. Check to see if a new upgrade or patch is available for this library. If not, you may have to uninstall it and install an earlier version.

Exploring Environment Variable Issues

If you have a newly installed application that is not executing, check the PATH environment variable. This particular variable determines what directories are searched for a program that the Bash shell does not directly handle. An example of displaying the variable’s contents is shown in Listing 24.8.

Listing 24.8: Viewing the PATH environment variable

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:
/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
$

If you need to modify this parameter for everyone, create a Bash script file (Bash scripts are covered in Chapter 25) in the /etc/profile.d/ directory. Be sure to use the .sh file extension. The file must be owned by root and belong to the root group. Set the file’s other (world) permissions to r so that all users can read the file. The script is read by the /etc/profile or /etc/bashrc file, depending on your distribution, when a user logs into the system or starts a new shell.

If only certain users need this particular PATH modification, make it in their ~/.profile, ~/.bash_profile, or ~./bash_login file. Environment files were discussed in Chapter 10.

Gaining GCC Compatibility

The most common tool used for compiling programs in Linux is the GNU Compiler Collection (GCC). If you have problems compiling an application on Linux with GCC, there are several potential causes. They are as follows:

  • GCC uses the system C library, which might not be compliant with the ISO C standard.
  • There are several notable incompatibilities between GNU C and non-ISO versions of C.
  • GCC uses corrected versions of system header files, which can cause issues.

Note that besides these issues, you might be using an older version of gcc and need to update it. For example, if your system distro is CentOS 7 and you are using gcc v4.4.*, you need to upgrade the GCC package.

You can find detailed documentation on the GCC compiler at its website— https://gcc.gnu.org/. This site includes FAQ and other useful information.

Perusing Repository Problems

The very first thing to check when you get an odd error message concerning a package that cannot be found, updated, or installed is your network connection. Often a system that is not network-connected causes this problem. However, various package repositories (also called repos) can become corrupted.

On a system using a Debian package manager, such as Ubuntu, if you get a message saying it cannot download repository information or something similar, use the apt-get clean command. This command cleans up the database and any temporary download files. After that, try to update the local repository with apt-get update, which attempts to retrieve updated information about packages in the repository.

images On a Debian package management system, consider using the apt-get dist-upgrade command instead of apt-get upgrade to update all the system’s packages. The dist-upgrade prevents any software from being upgraded that will break a dependent package.

On a system using a Red Hat package manager, such as CentOS, you can employ the yum clean all or zypper clean -a command, depending upon your distro. Next update the local repository with the yum check-update or zypper refresh command.

If you are attempting to install or update packages from a non-standard repository, you may need to enable that repo on your system. To see a list of the enabled repositories on your system, use the yum repolist or zypper repos command on Red Hat package systems. For Debian package systems, you’ll need to issue the grep -v "#" /etc/apt/sources.list command to see the enabled repositories.

Before you add any additional non-standard repositories, back up the repository file(s), such as these:

  • /etc/apt/sources.list
  • /etc/yum.repos.d/*.*
  • /etc/zypp/repos.d/*.*

You have to manually edit the sources.list file to add and enable a new repository. To add and enable a new repo with YUM, use the yum-config-manager --add-repo repository-url command. The zypper command is similar but also requires a repository name alias—zypper addrepo repository-url alias.

Keep in mind that we have only touched on a few of the more common problems you can run into with programs and their various dependencies. Be sure to employ your distribution’s man pages for additional help.

Looking at SELinux Context Violations

Application issues can be caused by your system’s Linux kernel security module, such as SELinux (covered in Chapter 15). An incorrect policy configuration, which triggers a violation, can prevent applications from serving their purpose. Check the audit log file using the sealert command first. If this tool is not available, you can install it via the setroubleshoot package.

A mislabeled file can cause problems, such as access being denied to applications. Use the ls -Z command to view a file’s SELinux context. If it or its parent directory needs to have their context changed, use the chcon utility to modify it, the semanage command to make it permanent, and restorecon to fix the labels. Don’t forget to employ all three of those commands or you won’t resolve the problem.

images You can change the mode for SELinux temporarily from enforcing to permissive via the setenforce permissive command with super user privileges. This allows you to make context changes and see if it triggers any violations without actually blocking access. Once you’ve got the correct SELinux policies in place, put it back into enforcing mode via the setenforce enforcing command.

An application that is confined by SELinux needs the proper Booleans set to allow appropriate access. The getsebool command will allow you to review the application’s Booleans. If you need to change them, employ super user privileges and the setsebool command.

images If you are seeing a great deal of SELinux context violations in your log or journal files and have not had application problems in the past, it is possible that your system has an intruder. Use an intrusion detection tool to confirm.

Exploring Firewall Blockages

If an application is experiencing problems over the network and there are no network issues, you may want to check the local and remote systems’ firewalls. Any application updates or firewall modifications can trigger this problem. Firewalls were covered in Chapter 18.

Unrestricting ACLs

A firewall ACL identifies a network packet by reviewing its control information along with other network data. Therefore, when troubleshooting an application issue related to a firewall, you’ll need to gather the following information for the application packets traveling back and forth:

  • Source address or hostname
  • Destination address or hostname
  • Network protocol(s) used
  • Inbound port(s) used
  • Outbound port(s) used

You also need to know both your source and destination systems’ firewall application in use. When you’ve gathered this information, you can review the firewall settings on both sides to determine if the ACLs are overly restrictive.

For example, if you are using firewalld on your application’s host system, you can quickly check the current default zone, as shown in Listing 24.9.

Listing 24.9: Viewing the default zone with the firewall-cmd command

$ firewall-cmd --get-default-zone
drop
$

Notice that this system has its firewalld default set to the drop zone. This means all incoming network packets are dropped and only outbound network connections are allowed. If the application receives data or connections from other systems, then this firewall ACL setting is overly restrictive.

Unblocking Ports

If your application relies on another system service (daemon), you’ll want to check rules related to the service port. Blocking a port needed by the external service would adversely affect the application. If your application is designed to use a port that is not dedicated to a well-known service, check it as well.

For example, if you are providing public web services on your system, you need to allow incoming and outbound packets associated with the HTTPS protocol port 443. If your system is using the iptables firewall software, you can view the current ACL rules via the iptables -L command. If the packet filter is blocking port 443 via a particular rule or policy, you can modify the chain via a command similar to the one in Listing 24.10, which opens up port 443.

Listing 24.10: Modifying the firewall with the iptables command

$ iptables -A INPUT -p tcp --dport 443 -j ACCEPT
$

Keep in mind you also need to modify the OUTPUT chain rules to allow your web server to establish connections. In addition, if your application allows HTTP traffic, you must modify rules for port 80 as well.

images View firewall log file entries as you investigate application problems. If needed, you can often increase the amount of information that is logged. For example, the UFW firewall has a full setting, which logs everything.

Unblocking Protocols

Besides ports, be aware of the various protocols, such as UDP, TCP, and ICMP, that your application employs. If it uses another system service, you must know the protocols it uses as well. The /etc/services file can help.

For example, an application is working with a DNS caching server on the local network. DNS protocol uses port 53. Check the /etc/services (well-known ports) file to find the transport protocols it employs as shown snipped in Listing 24.11.

Listing 24.11: Checking DNS’s protocols in the /etc/services file

$ grep 53 /etc/services
domain          53/tcp                       # Domain Name Server
domain          53/udp
[…]
$

Unblock port 53 on the DNS server system for both TCP and UDP, since DNS listens for requests using those two transport protocols. Also, unblock them for both inbound and outbound packets.

Troubleshooting Additional Hardware Issues

Linux requires hardware to operate. When hardware stops working correctly, Linux does not function properly. Thus, understanding how to troubleshoot all hardware is an essential skill as a Linux system administrator.

Looking at Helpful Hardware Commands

When you are troubleshooting hardware problems, there are many Linux command-line tools that can help. The lspci, lsusb, and lsdev commands are a few, which were introduced in Chapter 23. We’ll cover a couple more great utilities, dmidecode and lshw, here.

Understanding the dmidecode Utility

Before looking at the dmidecode command, you need to know about the Distributed Management Task Force (DMTF) and its standards. The DMTF is a nonprofit organization whose goal is to simplify the management of network-accessible technologies, like servers, through standards. In essence, it helps to make system administration easier.

DMTF created the Desktop Management Interface (DMI) and System Management BIOS (SMBIOS) standards. The DMI specification consists of four components, which provide information about the hardware being used on a computer as well as some additional helpful data. The SMBIOS standard consists of items, such as data structures, used to read management information produced by a computer’s BIOS. These two standards interact with each other and are widely adopted by hardware manufacturers.

To use these standards, you need two things—a DMI/SMBIOS compliant computer and a software interface to their data structures. The software interface on a Linux system is the dmidecode utility.

The dmidecode utility pulls its information, by default, from the sysfs filesystem and specifically from tables in the /sys/firmware/dmi/tables/ directory. You can check if those tables exist on your system by using the command in Listing 24.12. Notice on this system that the tables exist.

Listing 24.12: Checking for tables in the sysfs filesystem

# ls /sys/firmware/dmi/tables
DMI  smbios_entry_point
#

The help option on the dmidecode utility describes the various options you can use to uncover information in your hardware troubleshooting process. While you must use super user privileges with the command for extracting table information, you don’t have to do so for getting help. An example is shown in Listing 24.13.

Listing 24.13: Looking at the dmidecode help facility

$ dmidecode -h
Usage: dmidecode [OPTIONS]
Options are:
 -d, --dev-mem FILE     Read memory from device FILE (default: /dev/mem)
 -h, --help             Display this help text and exit
 -q, --quiet            Less verbose output
 -s, --string KEYWORD   Only display the value of the given DMI string
 -t, --type TYPE        Only display the entries of given type
 -u, --dump             Do not decode the entries
     --dump-bin FILE    Dump the DMI data to a binary file
     --from-dump FILE   Read the DMI data from a binary file
     --no-sysfs         Do not attempt to read DMI data from sysfs files
 -V, --version          Display the version and exit
$

Of the various options, the most useful for troubleshooting is the -t, or --type, switch. This allows you to pull specified information from the DMI/SMBIOS tables via providing an argument, which is either a number or a keyword. The keyword argument can be one of the following:

  • baseboard
  • bios
  • cache
  • chassis
  • connector
  • memory
  • processor
  • slot
  • system

If the tables do not contain the needed information, you will only receive a message about where the utility attempted to extract data and possibly DMI and/or SMBIOS standard versions supported. Two examples are shown in Listing 24.14. Notice that there is no memory information available in the tables, but some system data is displayed.

Listing 24.14: Looking at dmidecode table data

# dmidecode -t memory
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.5 present.

#
# dmidecode -t system
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.5 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: innotek GmbH
        Product Name: VirtualBox
        Version: 1.2
        Serial Number: 0
        UUID: 3909BE96-5CA6-4801-8236-D6113BB5D2CF
        Wake-up Type: Power Switch
        SKU Number: Not Specified
        Family: Virtual Machine

images If you are using a virtualized Linux system, the information from the dmidecode utility is suspect. Also, do not rely on this utility alone for hardware information. Its man page even states, “More often than not, information contained in the DMI tables is inaccurate, incomplete, or simply wrong.”

Understanding the lshw Utility

Hardware information is stored in various /proc/ directory files on your system. While you could go rooting around and dig it out yourself, the lshw utility does it for you. It provides data on your system’s processor(s), memory, NIC(s), USB controller(s), disk(s), and so on. It is typically installed by default on most distributions or available in a standard repository (Chapter 13 covered installing software packages).

Two helpful options are -short, which produces a nice table-formatted hardware data display, and -businfo, which shows information associated with SCSI, USB, IDE, and PCI devices. An example of using the -short option is shown snipped in Listing 24.15.

Listing 24.15: Using the -short option with the lshw command

# lshw -short
H/W path          Device      Class       Description
=====================================================
                              system      VirtualBox
/0                            bus         VirtualBox
/0/0                          memory      128KiB BIOS
/0/1                          memory      4GiB System memory
/0/2                          processor   Intel(R) Core(TM) […]
/0/100                        bridge      440FX - 82441FX PMC [Natoma]
/0/100/1                      bridge      82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1        scsi1       storage     82371AB/EB/MB PIIX4 IDE
/0/100/1.1/0.0.0  /dev/cdrom  disk        CD-ROM
/0/100/2                      display     VirtualBox Graphics Adapter
/0/100/3          enp0s3      network     82540EM Gigabit Ethernet Controller
/0/100/4                      generic     VirtualBox Guest Service
/0/100/5                      multimedia  82801AA AC'97 Audio Controller
/0/100/6                      bus         KeyLargo/Intrepid USB
/0/100/6/1        usb1        bus         OHCI PCI host controller
/0/100/7                      bridge      82371AB/EB/MB PIIX4 ACPI
/0/100/8          enp0s8      network     82540EM Gigabit Ethernet Controller
/0/100/d          scsi2       storage     82801HM/HEM (ICH8M/ICH8M-E) SATA […]
/0/100/d/0        /dev/sda    disk        16GB VBOX HARDDISK
/0/100/d/0/1                  volume      1GiB Linux filesystem partition
/0/100/d/0/2      /dev/sda2   volume      13GiB Linux LVM Physical Volume partition
[…]
/0/4                          input       PnP device PNP0f03
/1                virbr0-nic  network     Ethernet interface
[…]
#

You can also employ the -class options with the lshw utility. This provides detailed information concerning a particular hardware component. The different classes available are displayed in the lshw -short command’s output. A snipped example of using the -class option is shown in Listing 24.16.

Listing 24.16: Using the -class option with the lshw command

$ sudo lshw -class display
  *-display
       description: VGA compatible controller
       product: VirtualBox Graphics Adapter
       vendor: InnoTek Systemberatung GmbH
[…]
       configuration: driver=vboxvideo latency=0
[…]
$

images Another nice utility that can provide hardware information is the hwinfo command. It provides additional data for your troubleshooting process. If it is not installed by default on your distro, consider manually installing it.

Investigating Other Hardware Problems

Occasionally you have a hardware problem that is uncommon. Being able to quickly address these unique issues will make you stand out from your peers.

Memory Physical problems with RAM are tricky to diagnose. Some symptoms of this issue include a system’s performance slows over time, the system appears to hang when a memory-intensive application is running or at boot, kernel panics or segmentation faults occur intermittently, files are sporadically corrupted, and/or program installations fail.

First make sure it is not a memory capacity issue, which often shows symptoms similar to hardware problems. Check via the free and vmstat utilities.

You can quickly determine hardware information concerning your RAM via the lshw utility. Just issue the lshw -class memory command.

If you recently added new memory, most likely you obtained a faulty component. Damage can also be done to RAM via power spikes or outages. In any case, you’ll want to conduct a test on the memory. Typically you can conduct such a test via a system reboot and accessing the memtest or memtest86+ option in the server’s boot menu. If this option is not available, you can employ the memtester utility. This command-line utility is typically not installed by default, but it is available either in your distribution’s repository or as an RPM or dpkg file (see Chapter 13). When using this utility, you’ll need to shut down any production applications and test the memory in chunks.

Printers External hardware devices are typically plug and play for Linux, but odd problems do arise. When dealing with printers, the issue typically comes down to either an outdated/incorrect driver (PPD) or a bad connection. Start by checking the kernel ring buffer with dmesg and taking a look at the printer error log files, such as /var/log/cups/error_log.

images Don’t buy a doorstop. Make sure the printer your company is interested in purchasing is already supported by Linux. There are several websites that can help, such as www.openprinting.org/printers and www.linux-drivers.org/printer_scanner.html. In addition, use your favorite search engine and enter Linux Compatible Printers to find more.

If the printer was recently installed, check its configuration. You can do this via a web browser, if available, on the system by entering 127.0.0.1:631 in the address bar. If your system does not have a GUI, you can look through the /etc/cups/printers.conf file to review the printer’s configuration.

Determine how the printer connects to the system. Is it a network printer? Does it attach via a USB cable? Is the printer directly connected into a parallel port? If it is a network-connected printer, first check that the network is operational. If the printer is attached via a USB cable, start with troubleshooting the USB connections (covered later in this chapter). If it uses a parallel port on your system, it may be a bad adapter. If possible, consider switching it to a different connection type, such as USB, or obtaining a new printer with a more modern configuration.

Check if the printer’s PostScript Printer Definition (PPD) file or driver needs updating. Go to the manufacturer or open-source driver website to determine if an update is available. You can view all the currently available printer drivers on your system via the lpinfo -m command. Keep in mind the problem may involve a needed printer firmware update, so check for those as well.

images Some manufacturers provide their own Linux tools to assist in printer troubleshooting. For example, Hewlett Packard offers the hp-info and hp-toolset utilities to help in managing and problem-solving their printers’ issues.

Video Hardware issues with video show up in sluggish displays, audio lag, glitches on the screen, and so on. You may even see a black screen or receive no audio output. Some problems can even cause the system to crash or hang.

As with many other hardware problems, first check the kernel ring buffer (dmesg) and video log files. If your system is still using X11, check the journal file or the /var/log/Xorg.0.log file. If you are employing Wayland, check the journal file via the journalctl command.

images A graphics processing unit (GPU) exists on either a graphics card or on a motherboard. It is an assembly that performs some simple processing in order to relieve the CPU of such duties. Often a graphics card is called a GPU or GPU card.

To find out what graphics card driver your system is using, just type in lspci -vnn at the command line, redirect STDOUT to a file. Peruse the file for the word VGA. This will show your graphics card driver data. You can also employ lshw -class display (or video) and look for the driver information there. When you have the driver’s name, find out additional details through the modinfo driver-name command.

Check the manufacturer’s or open-source site to see if there is an updated graphics card driver available. If not, try testing the card on another system to see if you need to replace it.

images Some manufacturers provide their own utilities to manage their GPU cards. For example, Nvidia provides the nvidia-smi and nividia-settings commands for their graphics cards.

Communications Ports A communications port is a serial communications port. Though a rarity nowadays, it is often used to connect hardware such as point-of-sale devices. The device files that represent these serial ports are /dev/ttyS#.

When experiencing problems with a serial communications device, start the troubleshooting process by issuing dmesg | grep ttyS to find the device file name in use. When you have the full device file name, employ the setserial utility. This will provide detailed information on the serial device. Use super user privileges, and type in setserial -a device-file-name at the command line. Look for the interrupt request (IRQ) number in the output.

When you know the IRQ of the serial device, you can check the interrupts file. If you do not find the IRQ number in the /proc/interrupts file, this indicates that the appropriate driver for the serial device is not loaded.

If the driver is loaded, check the manufacturer’s website for a newly updated driver. Also, check the serial device’s recommended configuration and make any modifications needed via the setserial utility.

USB If you have a USB device, such as a printer, directly attached to your system and problems occur, there are some simple troubleshooting techniques you can employ. First ensure that the USB module (driver) is loaded into the kernel by using super user privileges and typing in lsmod | grep usb at the command line. If you get a response, it is loaded. If you just get a prompt back, then employ the modprobe command to load the module (Chapter 14).

If the driver is already loaded, try detaching the device’s USB cable from the system. Watch the journal file via the journalctl -f command. If you are on an older Linux system, use the tail -f command on the appropriate log file, such as /var/log/syslog or /var/log/rsyslog. After the watch is in place, plug back in the USB device’s cable and see what log messages are generated. If the USB device is a printer, also check the /var/log/cups/error_log for any pertinent information. You may uncover some important details here.

When you have completed that activity, employ the lsusb -v command to see if the device is showing up on the USB bus. If you see the device’s manufacturer and product information, then Linux can see the device. If the lsusb utility is not installed on your system, look through the kernel ring buffer via dmesg.

Check the USB’s device files for corruption. This topic was covered earlier in the chapter in the Missing Devices list item in the section “Exploring Common Issues.”

If your USB device is still not working, try attaching it to a different USB port. However, prior to doing so, put another watch on the appropriate journal or log files. You may also want to try switching out your USB cable to see if that resolves the issue(s).

Keyboard Mapping If you press a key on your keyboard and a different letter appears on the screen, most likely you have a keyboard mapping issue. The fix depends on the particular distribution you are using.

For Red Hat–based distros, type in localectl with no options and your current key map will display. To see the list of available key maps, enter localectl list-keymaps and a list of available key mappings will display. This list can be rather large, so you might want to redirect STDOUT to the less utility for your perusal. When you find the appropriate key mapping name, permanently set it by typing in localectl set-keymap keymap-name at the command line.

images You may wonder how you will enter these commands if your keyboard is not properly mapped. Write the commands down, and then try the various keyboard keys until you find each key that corresponds with every command letter or symbol and record it. Now use the recorded keyboard keys to enter the commands. Ta-da!

For Debian-based distros, use super user privileges and enter the dpkg-reconfigure keyboard-configuration command. This will enter you into a text-based menu system where you can select the appropriate keyboard mapping.

Hardware or Software Compatibility Issues Before you purchase any new hardware (or software for that matter), make sure it will work with your Linux distribution. Keep in mind that while Linux is the number-one operating system kernel for super computers and a strong contender in the server world, it does not always get the attention it deserves from hardware manufacturers. Therefore, often drivers are not available for brand-new devices, or you may end up with a manufacturer’s poorly written device driver. Check with the Linux community to find well-developed drivers and hardware device recommendations. You’ll save yourself a lot of trouble and headaches.

Summary

From application directory and file permissions to overly restrictive firewall ACLs and incorrect SELinux contexts, there are many issues that can cause an application to not function properly. In addition, hardware problems such as bad disk sectors, memory module corruption, flaky USB cables, and device drivers that need updating all require a knowledgeable troubleshooter. Having a firm grasp on this chapter’s concepts will help you achieve that distinction.

Exam Essentials

Summarize application permission issues. When an application throws an error relating to either I/O or an attempt to launch another executable, it can be due to an incorrect file or directory permission. Determine what user account the application is running under as well as any files it is attempting to access and their residing directories. With that information in hand, gather file ownership and group membership. Looking at the various permissions associated with each of the three permission classifications (owner, group, other) will begin to uncover the core problem. Include directory permissions and default ACLs as well in the investigation.

Describe storage problems. Common storage issues involve degraded storage, missing devices and/or volumes, absent mount points, and performance issues. They also may include storage integrity problems and/or resource exhaustion. The dmesg utility is essential for its use in uncovering root causes of problems with SATA and SCSI drives as well as the HBA. Uncovering and fixing RAID issues also requires the use of the Multiple Devices (md) utility and /proc/mdstat file.

Explain application dependencies. Using the appropriate utility and checking an application’s version as well as available package versions will allow you to uncover whether or not a poor performing application’s software has an upgrade available for it. Updating software packages, however, is not without problems. A software update may not properly update a package’s dependencies or libraries, resulting in a broken application. If the new update needs to be compiled, issues with the GCC can cause complications. The system’s package repository can have uncovered troubles, which prevent a software update from occurring.

Detail restrictive firewall ACLs. Applications that communicate with data, services, or end users over a network may run into problems with overly restrictive firewall settings. Gather together the source address (or host), destination address, and network protocols employed as well as the inbound and outbound ports used on both the client and server side. Using this basic information, review the firewall’s various ACLs. If a firewall setting is blocking this needed access, review the potential needed changes prior to enacting them.

Summarize uncommon hardware issues. RAM, printers, video apparatus, serial ports, USB devices, and keyboards can provide interesting problems to troubleshoot. Employing the dmidecode and lshw utilities as well as the dmesg, lspci, lsusb, and lsdev commands provides assistance in uncovering the root causes. Missing or outdated modules (drivers), faulty cables, corrupt devices files, and incorrect key maps are some of the problem sources. You can save yourself some time and avoid issues in the first place if you ensure that your hardware and software are compatible prior to installing them.

Review Questions

  1. Peter’s system has a memory-intensive application running on it continually. To help improve performance, he has replaced the old hard drives with solid-state drives instead of increasing RAM. Which of the following is most likely true about this situation?

    1. The SSD for application data will enter into a degraded mode.
    2. The SSD for swap will become degraded storage.
    3. The SSD will need a namespace in its device file name.
    4. The SSD will end up a missing volume.
    5. The SSD will experience resource exhaustion.
  2. Mary adds the first SCSI disk to a Linux system that currently has only IDE drives. The system is not recognizing the new disk. Which of the following commands should she employ to troubleshoot the problem? (Choose all that apply.)

    1. ls /sys/bus/scsi/drivers
    2. pvscan /dev/vg00/lvol0
    3. lsmod | grep module-name
    4. hdparm -B 127 device-filename
    5. smartctl -a
  3. The system administrator, Norman, runs a Python program and receives an IO Error: [Error 13] Permission denied message. What information should he gather (or know) to start troubleshooting this issue? (Choose all that apply.)

    1. The disk type, where the program resides
    2. His user account name
    3. The program action that raised the error
    4. File name and directory location of the program’s I/O files
    5. The program’s name
  4. Harry has modified an application to create a file in a directory and then write data to it. The program creates the file with no problems but cannot write data to it and receives a permission error. Which of the following is most likely the issue?

    1. Directory ownership
    2. File ownership
    3. File group membership
    4. Permission inheritance
    5. Executable privileges
  5. Ben updates his Ubuntu system’s packages using the sudo apt-get upgrade command, and now the Apache Web service is not working properly. What command should he run?

    1. sudo apt-get clean
    2. sudo zypper clean -a
    3. sudo ldd /usr/sbin/apache2
    4. sudo rpm -aV
    5. sudo apt-get check
  6. Peter writes a new C++ application to use for managing his older Linux server. The new app contains no programming or logic errors. However, when he tries to compile it, it does not work. Which of the following is most likely the issue?

    1. An incorrect application permission
    2. An incorrect file permission
    3. A missing or outdated GCC
    4. A missing or outdated device
    5. A repository problem
  7. Mary confirms via the sealert utility that her application cannot access the file flash.txt. What command should she use next?

    1. ls -l flash.txt
    2. ls -Z flash.txt
    3. ls -l flash.txt-directory
    4. setroubleshoot
    5. restorecon
  8. A clock-in/out application, which uses an NTP server on the local network, is throwing an error concerning reaching the server. There are currently no network problems. Which of the following are steps in the troubleshooting process for this issue? (Choose all that apply.)

    1. Check the firewall ACLs on the NTP server.
    2. Check the firewall ACLs on the application server.
    3. Use the firewall-cmd --get-default-zone command.
    4. Check the /etc/services file for NTP ports and transport protocols.
    5. View firewall log entries.
  9. Your system administrator team member Norman tells you the device located at the communications port is not working. What command should you issue to start the troubleshooting process?

    1. dmesg | grep -i COM
    2. dmesg | grep -i ttys
    3. sudo setserial -a /dev/COM1
    4. sudo setserial -a /dev/ttyS0
    5. cat /proc/interrupts
  10. Harry’s newly installed USB printer is not working. The system employs CUPS. Which of the following are steps that may be included in the troubleshooting process? (Choose all that apply.)

    1. Issue the less /etc/printcap command.
    2. Use the lpinfo -m command to view available USB ports.
    3. Put a watch on the appropriate log file and plug in the USB cable.
    4. Use the dmesg and grep utilities to find printer information.
    5. Use the lsusb -v command to see if the device is on the USB bus.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset