Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 20
Analyzing System Properties and Remediation

Objective 4.1: Given a scenario, analyze system properties and remediate accordingly

images Even well-maintained Linux systems run into problems. New or modified applications introduce different performance variables, unforeseen incidents cause outages, and aging hardware components may fail. Minimizing their effects requires understanding troubleshooting techniques and tools as well as the interactions between various system components.

Troubleshooting the Network

When network problems occur (and they will), devise a troubleshooting plan. First identify symptoms, review recent network configuration changes, and formulate potential problem cause theories. Next, using the Open Systems Interconnection (OSI) model as a guide, look at hardware items (for example, cables), proceed to the Data Link layer (for example, network card drivers), continue to the Network layer (for example, routers), and so on.

Exploring Network Issues

In order to properly create a troubleshooting plan, you need to understand various network configuration and performance components. Understanding these elements assists in creating theories about problem causes as well as helps your exploration process through the OSI model.

Speeding Things Up

Familiarity with a few network terms and technologies will help in troubleshooting network problems and improving its performance.

Bandwidth Bandwidth is a measurement of the maximum data amount that can be transferred between two network points over a period of time. This measurement is typically represented by the number of bytes per second.

As an example, think about road design. Some roads are designed to handle cars traveling at 65 mph (~105 kph) safely. Other roads can only deal with traffic moving at around 35 mph (~56 kph).

Throughput Throughput is a measurement of the actual data amount that is transferred between two network points over a period of time. It is different from bandwidth in that bandwidth is the maximum rate and throughput is the actual rate.

Throughput may not reach the maximum bandwidth rate due to items such as a failing NIC or simply the protocol in use. Returning to the roadway analogy, though some roads can handle cars traveling at 65 mph safely, some cars may travel slower due to potholes on the road.

Saturation Network saturation, also called bandwidth saturation, occurs when network traffic exceeds capacity. In this case, a towel analogy is helpful. Imagine you have a towel that has absorbed as much water as it can. Once it can no longer absorb any more water, it has become saturated.

Saturation is also sometimes called congestion. Using our traffic analogy, when too many cars are on a particular roadway, congestion occurs and traffic slows.

Latency Latency is the time between a source sending a packet and the packet’s destination receiving it. Thus, high latency is slow, which is typically a problem, and low latency is fast, which is often desired.

High latency is often caused by low bandwidth or saturation. In addition, routers overloaded by network traffic may cause high network latency.

Jitter is a term used to indicate high deviations from a network’s average latency. For streaming services such as video, jitter can have a serious negative impact.

Routing Because a network is broken up into segments, you need routing to get packets from point A to point B through the network’s various segments. Routers connect these network segments and forward IP packets to the appropriate network segment toward their ultimate destination.

Routers contain buffers that allow them to hold onto network packets when their outbound queues become too long. However, if the router cannot forward its IP packets in a reasonable time frame, it will drop packets located in its buffer. This condition often transpires when network bandwidth saturation is occurring.

Some router manufacturers attempt to avoid packet loss by increasing their routers’ buffer size. This leads to a condition called bufferbloat, which increases network latency in congested segments due to packets staying too long in the router’s buffer. You can find out more information about bufferbloat, how to test your routers for it, as well as resolutions at www.bufferbloat.net.

Dealing with Timeouts and Losses

A packet drop, also called packet loss, occurs when a network packet fails to reach its destination. Unreliable network cables, failing adapters, network traffic congestion, and underperforming devices are the main culprits of packet drop.

UDP does not guarantee packet delivery. Therefore, in services like VoIP that employ UDP, minor packet loss does not cause any problems. Most VoIP software compensates for these packet drops. You may hear what sounds like choppiness in a person’s voice as you speak over a VoIP connection when it is experiencing minor packet drops.

TCP guarantees packet delivery and will retransmit any lost packets. Thus, if network packet loss occurs for services employing TCP, it will experience delays. If the packet drops are due to network traffic congestion, in some cases, the TCP packet retransmission only makes matter worse. Keep in mind that IP allows routers to drop packets if their buffer is full and they cannot send out their buffered packets fast enough. This will also cause TCP to retransmit packets.

Packet drops on a router can also be caused by a DOS attack, called a packet drop or black hole attack. A malicious user manually or through software gains unauthorized access to a router. The router is then configured to drop packets instead of forwarding them.

In network communication, timeouts are typically preset time periods for handling unplanned events. For example, you open a web browser and proceed to enter the address to a site you wish to visit. The website is down, but still the browser attempts a connection. After a predetermined amount of time (the designated timeout period), the browser stops trying to connect and issues an error message.

You may experience network communication timeouts for a number of reasons:

A system is down.
An incorrect IP address was used.
A service is not running or not offered on that system.
A firewall is blocking the traffic.
Network traffic is congested, causing packet loss.

Each of these items is worth exploring, if you or your system’s processes are experiencing timeouts related to network communications.

Resolving the Names

The process of translating between a system’s fully qualified domain name (FQDN) and its IP address is called name resolution. The Domain Name System (DNS) is a network protocol that uses a distributed database to provide the needed name resolutions.

Most systems simply use client-side DNS, which means they ask other servers for name resolution information. Using the /etc/resolv.conf and /etc/hosts files for configuring client-side DNS was covered in Chapter 7. However, there are a few additional items concerning name resolution problems and performance that you need to know.

Name Server Location With client-side DNS, when it comes to name server selection, location matters. If the name server you have chosen to set in the /etc/resolv.conf file is halfway around the world, your system’s name resolutions are slower than if you chose a physically closer name server.

Consider Cache A caching-only name server holds recent name resolution query results in its memory. If you are only employing client-side DNS, consider configuring your system to be a caching-only server, using software such as dnsmasq. By caching name resolutions, resolving speeds can improve significantly.

Secondary Server If you are managing a DNS server for your company, besides a primary (master) server, consider configuring a secondary (slave) server. This name server receives its information from the primary server and can increase name resolution performance by offloading the primary server’s burden.

Configuring It Right

Network configuration was covered in Chapter 7. However, there are a few additional special topics that may help you with troubleshooting.

Interface Configurations Being able to view a NIC’s configuration and status is important in the troubleshooting process. You may need to view its IP address, its MAC address, subnet configuration, error rates, and so on. In addition, understanding configuration items such as whether or not a NIC has a static or DHCP-provided IP address is part of this process.

Be aware that if you use NetworkManager on a system with firewalld as its firewall service, when a new network device is added, it will automatically be added to the firewalld default zone. If the default zone is set to public, the network device will only accept selected incoming network connections. See Chapter 18 for more details on firewalld.

Ports and Sockets Ports and sockets are important structures in Linux networking. Understanding the difference between the two will help in the troubleshooting process.

A port is a number used by protocols, such as TCP and UDP, to identify which service or application is transmitting data. For example, port 22 is a well-known port designated for OpenSSH, and DNS listens on port 53. TCP and UDP packets contain both the packet’s source and destination ports in their headers.

A program connection to a port is a socket. A network socket is a single endpoint of a network connection’s two endpoints. That single endpoint is on the local system and bound to a particular port. Thus, a network socket uses a combination of an IP address (the local system) and a port number.

Localhost vs. a Unix Socket The localhost designation and a Unix socket are often used for services, such as SQL. Being able to differentiate between the two is helpful.

Localhost is the host name for the local loopback interface, which was first described in Chapter 7. Localhost uses the IPv4 address of 127.0.0.1 and the IPv6 address of ::1. Basically, it allows programs on the current system to test or implement networking services via TCP without needing to employ external networking structures.

Unix sockets, also called Unix domain sockets, are endpoints similar to network sockets. Instead of between systems over a network, these endpoint sockets are between processes on your local system. Your system’s Unix domain sockets perform inter-process communications (IPC), which operates in a manner similar to a TCP/IP network. Thus, these sockets are also called IPC sockets.

If you have a service configuration choice between the two, typically a Unix socket will provide better performance than the localhost. This is due to the system employing normal networking behavior that consumes resources, such as performing data checksums and TCP handshakes when using localhost. In addition, due to special Unix socket files and the fact that Unix sockets understand file permissions, you can easily employ access control via setting access rights to these files.

Adapters Network adapters are system hardware that allows network communications. These communications can be wired or wireless. Adapters also come in USB form factors but are not typically used in enterprise server environments.

Common problems that arise with network adapters are faulty or failing hardware and incorrect or inefficient drivers. In regard to faulty hardware, error rates on adapters generally should not exceed 0.01% of the adapter’s bps throughput rate.

Though a network interface card (NIC) is an adapter, an adapter is not always a NIC. For example, a USB network adapter is not a NIC.

RDMA (Remote Direct Access Memory) A technology to consider, if your system’s network needs low latency, is RDMA. It allows direct access between a client’s and server’s memory. The results are significantly reduced network latency, higher bandwidth, and the side benefit of reducing the server’s CPU overhead.

Unfortunately, this technology requires special hardware. To use it on a standard Linux system Ethernet NIC, you’ll need to employ soft-RoCE. This software provides RDMA features over converged Ethernet (RoCE). What is really nice about soft-RoCE is that its driver is part of the Linux kernel, starting at v4.8.

Viewing Network Performance

Starting the troubleshooting process requires knowledge of the various tools to use. Here we provide a few tables to assist in your tool selection.

Since high latency (slowness) and network saturation tend to occur together, Table 20.1 shows tools you should use to tackle or monitor for these problems. Keep in mind that you should already know the bandwidth of the network segment(s) you are troubleshooting prior to using these tools.

Table 20.1 Commands to check for high latency, saturation

Command	Description
`iperf`, `iperf3`	Perform network throughput tests. The `iperf` command is version two of the utility, while `iperf3` is version three.
`iftop -i` `adapter`	Displays network bandwidth usage (throughput) for `adapter` in a continuous graph format.
`mtr`	Displays approximate travel times and packet loss percentages between the first 10 routers in the path from the source to the destination in a continuous graph or report format.
`nc`	Performs network throughput tests. (Called Netcat)
`netstat -s`	Displays summary statistics that are broken down by protocol and contain packet rates, but not throughput. This command is deprecated.
`ping`, `ping6`	Perform simple ICMP packet throughput tests and displays statistics on items such as round-trip times.
`ss -s`	Displays summary statistics that are broken down by socket type and contain packet rates but not throughput.
`tracepath`, `tracepath6`	Display approximate travel times between each router from the source to the destination, discovering the maximum transition unit (MTU) along the way.
`traceroute`, `traceroute6`	Display approximate travel times between each router from the source to the destination.

Some of these tools are not installed by default. Also, they may not be in a distribution’s standard repositories. See Chapter 13 for details on how to install software packages.

To employ the iperf utility for testing throughput, you’ll need two systems—one to act as the server and the other as a client. The utility must be installed on both systems, and you’ll need to allow access to its default port 5001 (port 5201 for iperf3) through their firewalls. A snipped example of setting up and starting the iperf server on an Ubuntu system is shown in Listing 20.1.

Listing 20.1: Setting up the iperf server

$ sudo ufw allow 5001
Rule added
Rule added (v6)
$
$ iperf -s -t 120
------------------------------------------------------------
Server listening on TCP port 5001
[…]

The iperf command’s -s option tells it to run as a server. The -t option is handy, because the service will stop after the designated number of seconds. This helps to avoid using Ctrl+C to stop the server.

Once you have the server side ready, configure the client side and perform a throughput test. A snipped example of setting up and starting an iperf client on a Fedora system is shown in Listing 20.2. Though the last output summary lists a Bandwidth column header, it is really showing you the achieved throughput.

Listing 20.2: Setting up the iperf client and conducting a throughput test

$ sudo firewall-cmd --add-port=5001/udp
success
$ sudo firewall-cmd --add-port=5001/tcp
success
$
$ iperf -c 192.168.0.104 -b 90Kb -d -P 5 -e -i 10
------------------------------------------------------------
Server listening on TCP port 5001 with pid 3857
[…]
Client connecting to 192.168.0.104, TCP port 5001 with pid 3857
[…]
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[…]
[SUM] 0.00-10.04 sec   640 KBytes   522 Kbits/sec  5/0         2
[..]
[SUM] 0.00-11.40 sec  1.19 MBytes   873 Kbits/sec  850    5:5:0:0:0:0:0:835
$

Notice that a firewall rule was added for UDP traffic as well as TCP. This is necessary due to the use of the -b switch on the iperf client, which requires UDP. There are many options available with the iperf and iperf3 utilities. The ones used in Listing 20.2 are described in Table 20.2.

Table 20.2 Basic iperf client-side options

Option	Description
`-c` `server-address`	Creates a client that connects to the server located at `server-address`.
`-b` `size`	Sets target bandwidth to `size` bits/sec (default is 1Mb).
`-d`	Performs a bidirectional test between client and server.
`-P` `n`	Creates and runs `n` parallel client threads.
`-e`	Provides enhanced output. Not available on older utility versions.
`-i` `n`	Pauses between periodic bandwidth reports for `n` seconds.

Another handy utility to test throughput is the Netcat utility, whose command name is nc. Like iperf, you need to set up both a server and a client to perform a test. Listing 20.3 shows an example of setting up a Netcat server on an Ubuntu system. Notice how the firewall is modified to allow traffic to the port selected (8001) for the Netcat server.

Listing 20.3: Setting up the nc server

$ sudo ufw allow 8001
Rule added
Rule added (v6)
$
$ nc -l 8001 > /dev/null

The -l option on the nc command tells it to go into listening mode and act as a server. The 8001 argument tells Netcat to listen on port 8001. Because Netcat is being used to test throughput, there is no need to display any received data. Thus, the data received is thrown into the black hole file (/dev/null).

To conduct a test with Netcat, employ a utility that will send packets to the server and allow you to see the throughput rate. The dd command (covered in Chapter 12) works well, and an example of conducting a test on a Fedora system is shown in Listing 20.4.

Listing 20.4: Setting up the nc client and conducting a throughput test

$ dd if=/dev/zero bs=1M count=2 | nc 192.168.0.104 8001 -i 2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.10808 s, 19.4 MB/s
Ncat: Idle timeout expired (2000 ms).
$

Notice for the dd command, no output file (of) switch is used. This forces the command’s output to go to STDOUT, which is then redirected via the pipe (|) into the nc command. The nc command sends the output as packets through the network to the Netcat server’s IP address (192.168.0.104) at port 8001, where the listening server receives the packets. The -i 2 option tells Netcat to quit and return to the command line after 2 seconds of idle time.

The throughput rate is displayed when dd completes its operation. You can increase the test traffic sent by increasing the data amount designated by the dd command’s bs option and the number of times it is sent via the count switch. In the Listing 20.4 example, only 1Mb was sent two times.

On some distributions, you will find a netcat command. This command is simply a softlink to the nc utility and provided for convenience.

High latency is sometimes caused by overloaded routers. Faulty hardware or improper configurations, such as the router’s MTU being set too low, also contribute to the problem. The tracepath and traceroute utilities not only display what path a packet takes through the network’s routers but can provide throughput information as well allow you to pinpoint potential problem areas.

The mtr utility can provide a nice graphical display or a simple report showing a packet’s path, travel times, and items such as jitter. Listing 20.5 shows an example of using mtr to produce a static report.

Listing 20.5: Producing router performance reports with the mtr utility

$ mtr -o "L D A J" -c 20 -r 96.120.112.205
Start: 2018-12-21T13:47:39-0500
HOST: localhost.localdomain       Loss%  Drop    Avg  Jttr
  1.|-- _gateway                   0.0%     0    0.8   0.1
  2.|-- _gateway                   0.0%     0    3.8   7.1
  3.|-- 96.120.112.205             0.0%     0   15.5   1.8
$

The mtr command’s -o option allows you to specify what statistics to view. In this example, packet loss (L), drop (D), travel time average (A), and jitter (J) are chosen. The -c switch lets you set the number of times a packet is sent through the routers, and the -r option designates that you want a static report. To show a continuous graphical display, leave off the -c and -r options. The last mtr argument is the destination IP address.

A faulty or failing adapter also can contribute to high latency. If you suspect that errors, packet drops, or timeouts are causing network problems, try employing the utilities in Table 20.3 to display these statistics.

Table 20.3 Commands to find failing/faulty network adapters

Command	Description
`ethtool -S` `adapter`	Shows `adapter` summary statistics.
`ifconfig` `adapter`	Shows `adapter` summary statistics. This command is deprecated.
`ip -s link show` `adapter`	Shows `adapter` summary statistics.
`netstat -i` `adapter`	Shows `adapter` summary statistics. To view over time, add the `-c` `#` switch, where `#` is the number of seconds between displays. This command is deprecated.

Using the ip utility is shown snipped in Listing 20.6. Notice that even though no packets have been dropped, it does show error rates hovering around 0.05% (RX or TX packets / RX or TX errors). Any rate over 0.01% is enough to consider the adapter faulty.

Listing 20.6: Viewing network adapter statistics with the ip utility

$ ip -s link show enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 […]
[…]
    RX: bytes  packets  errors  dropped overrun mcast
    201885893  3023900  1510    0       0       318
    TX: bytes  packets  errors  dropped carrier collsns
    24443239380 15852137 7922   0       0       0
$

The ping and ping6 utilities, covered earlier in Table 20.1, are helpful in discovering packet loss and timeout issues. In fact, they are often the first tools employed when such problems are occurring.

If you have a rather tricky network problem, it may be worth your while to directly look at the packets traveling over it. The tools to do so go by a variety of names, such as network sniffers and packet analyzers. Three popular ones are Wireshark, tshark, and tcpdump.

Wireshark (a GUI program) and tshark (also called terminal-based Wireshark) are closely linked and that causes confusion when it comes to their installation. Table 20.4 lists the packages names needed to obtain the correct tool for each distribution covered by this book.

Table 20.4 Listing Wireshark GUI and tshark package names

Distribution	Wireshark GUI Package	`tshark` Package
CentOS 7	`wireshark-gnome`	`wireshark`
Fedora 28	`wireshark`	`wireshark`
OpenSUSE Leap 15	`wireshark`	`wireshark`
Ubuntu 18.04 LTS	`wireshark`	`tshark`

Once you have the proper package installed, you can employ the tool to analyze packets. A simple snipped example of using tshark on an Ubuntu system is shown in Listing 20.7. The tshark command’s -i option allows you to specify the interface from which to sniff packets. The -c switch lets you specify the number of packets to capture.

Listing 20.7: Using tshark to view packets

$ sudo tshark -i enp0s8 -c 10
[…]
Capturing on 'enp0s8'
    1 0.000000000 192.168.0.100 → 239.255.255.250 […]
    2 0.493150205 fe80::597e:c86f:d3ec:8901 → ff02[…]
    3 0.683985479 192.168.0.104 → 192.168.0.101 SSH[…]
    4 0.684261795 192.168.0.104 → 192.168.0.101 SSH[…]
    5 0.684586349 192.168.0.101 → 192.168.0.104 TCP[…]
[…]
   10 1.198757076 192.168.0.104 → 192.168.0.101 SSH[…]
198 Server: Encrypted packet (len=144)
10 packets captured
$

Both tshark and tcpdump allow you to store the sniffed data into a file. Later the packet information can be viewed using the Wireshark utility, if you prefer viewing data via a GUI application.

Reviewing the Network’s Configuration

In the network troubleshooting process, you might want to check your various network configurations. Network adapter configurations were covered in Chapter 7. You can use the tools listed earlier in Table 20.3 as well as the nmcli utility to review adapter settings.

Within a local network segment, routers do not use an IP address to locate systems. Instead they use the system’s network adapter’s media access control (MAC) address. MACs are mapped to their server’s IPv4 address via the Address Resolution Protocol (ARP) table or IPv6 address Neighborhood Discovery (NDISC) table. An incorrect mapping or duplicate MAC address can wreak havoc in your network. Table 20.5 has the commands to use to investigate this issue.

Table 20.5 Commands to check for incorrect MAC mappings or duplicates

Command	Description
`arp`	Displays the ARP table for the network’s neighborhood. Checks for incorrect or duplicate MAC addresses. This command is obsolete.
`ip neigh`	Displays the ARP and NDISC tables for the network’s neighborhood. Checks for incorrect or duplicate MAC addresses.

A misconfigured routing table can also cause problems. Double-check your system’s routing table via the route command (deprecated) or the ip route show command.

Incorrect DNS information for your own servers is troublesome. Also, if you are considering changing your client-side DNS configuration, there are some utilities that can help you investigate slow query responses. The commands in Table 20.6 are good utilities to guide your investigations.

Table 20.6 Commands to research name server responses

Command	Description
`host` `FQDN`	Queries the DNS server for the `FQDN` and displays its IP address. Check the returned IP address for correctness.
`dig` `FQDN`	Performs queries on the DNS server for the `FQDN` and displays all DNS records associated with it. Check the returned information for correctness.
`nslookup`	Executes various DNS queries in an interactive or noninteractive mode. Check the returned information for correctness.
`whois`	Performs queries of Whois servers and displays `FQDN` information stored there. Check the returned information for correctness.

The nslookup utility is very handy for testing DNS lookup speeds. You follow the command with the FQDN to look up and then the DNS server you desire to test. Use it along with the time command to gather lookup time estimates as shown snipped in Listing 20.8.

Listing 20.8: Testing DNS lookup speeds with the nslookup utility and time command

$ time nslookup www.linux.org 8.8.8.8
[…]
Name:   www.linux.org
Address: 104.27.166.219
[…]
real    0m0.099s
[…]
$ time nslookup www.linux.org 9.9.9.9
[…]
Name:   www.linux.org
Address: 104.27.167.219
[…]
real    0m0.173s
[…]
$

If your system employs IPsets in its firewall or other configurations, you may want to review those as well. Use super user privileges and type in ipset list to see the various IPsets and then review their use within configuration files.

The Network Mapper (nmap) utility is often used for pen testing. However, it is also very useful for network troubleshooting. Though it’s typically not installed by default, most distros have the nmap package in their standard repositories.

There are a number of different scans you can run with nmap. The snipped example in Listing 20.9 shows using nmap inside the system’s firewall to see what ports are offering which services via the -sT options.

Listing 20.9: Viewing TCP ports and services using the nmap utility

$ nmap -sT 127.0.0.1
[…]
PORT    STATE SERVICE
22/tcp  open  ssh
631/tcp open  ipp
[…]
$

You can use nmap to scan entire network segments and ask for the mapper to fingerprint each system in order to identify the operating system running there via the -O option. To perform this scan, you need super user privileges as shown in snipped Listing 20.10.

Listing 20.10: Viewing network segment systems’ OSs using the nmap utility

$ sudo nmap -O 192.168.0.*
[…]
Nmap scan report for 192.168.0.102
[…]
Running: Linux 3.X|4.X
[…]
Nmap scan report for Ubuntu1804 (192.168.0.104)
[…]
Running: Linux 3.X|4.X
[…]

Do not run the network mapper utility outside your home network without permission. For more information, read the nmap utility’s legal issue guide at https://nmap.org/book/legal-issues.html.

Troubleshooting Storage Issues

Data storage is one of the areas where your systems can encounter problems. Trouble with storage tends to focus on failing hardware, disk I/O latency, and exhausted disk space. We’ll focus on those three issues in the following sections.

Running Out of Filesystem Space

Nothing can ruin system uptime statistics like application crashes due to drained disk space. Two utilities that assist in troubleshooting and monitoring filesystem space are the du and df commands, which were covered in Chapter 11.

The df utility allows you to view overall space usage. In the example in Listing 20.11, only the ext4 filesystems are viewed via the -t option, and the results are displayed in human-readable format (-h), providing a succinct display.

Listing 20.11: Viewing filesystem space totals using the df utility

$ df -ht ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       9.8G  7.3G  2.0G  79% /
$

If you see a filesystem whose usage is above desired percentages, locate potential problem areas on the filesystem via the du command. First obtain a summary by viewing the filesystem’s mount point directory and only display space used by first-level subdirectories via the -d 1 option. An example is shown snipped in Listing 20.12 for a filesystem’s whose mount point is the / directory.

Listing 20.12: Viewing subdirectory space summaries using the du utility

$ sudo du -d 1 /
[…]
2150868 /var
[…]
48840   /home
{…]

After you find potential problem subdirectories, start digging down into them via du to find potential space hogs, as shown snipped in Listing 20.13.

Listing 20.13: Finding potential space hogs using the du utility

$ sudo du /var/log
[…]
499876  /var/log/journal/e9af6ca5a8fb4a70b2ddec4b1894014d
[…]

If you find that the filesystem actually needs the disk space it is using, the only choice is to add more space. If you set up the original filesystem on a logical volume, adding space via LVM tools is fairly simple (creating a logical volume was covered in Chapter 11).

If you don’t have an extra physical volume in your volume group to add to the filesystem volume needing disk space, do the following:

Add a spare drive to the system, if needed.
Create a physical volume with the pvcreate command.
Add the new physical volume to the group with vgextend.
Increase the logical volume size by using the lvextend command.

Waiting on Disk I/O

If a disk is experiencing I/O beyond what it can reasonably handle, it can slow down the entire system. You can troubleshoot this issue by using a utility that displays I/O wait times, such as the iostat command. I/O wait is a performance statistic that shows the amount of time a processor must wait on disk I/O.

If you find that the iostat utility is not installed on your system, install the sysstat package to obtain it. Package installation was covered in Chapter 13.

The syntax for the iostat command is as follows:

iostat [OPTION] [INTERVAL] [COUNT]

If you simply enter iostat at the command line and press Enter, you’ll see a static summary of CPU, filesystem, and partition statistics since the system booted. However, in troubleshooting situations, this is not worthwhile. There are a few useful iostat options to use for troubleshooting:

-y: Do not include the initial “since system booted” statistics.
-N: Display registered device mapper names for logical volumes.
-z: Do not show devices experiencing no activity.
-p device: Only show information regarding this device.

The iostat command’s two arguments allow viewing of the statistics over time. The [INTERVAL] argument specifies how many seconds between each display, and [COUNT] sets the number of times to display. Keep in mind that if you use the -y option, you will not see the first statistics until after the set interval.

An example using iostat with appropriate options for a troubleshooting situation is shown snipped in Listing 20.14. In this case, only two statistics are shown 5 seconds apart. You can see that there is a rather high I/O wait (%iowait column) percentage indicating a potential problem.

Listing 20.14: Troubleshooting I/O wait using the iostat utility

$ iostat -yNz 5 2
[…]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          26.80    0.00   42.27   30.93    0.00    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             409.90         3.30      9720.72         16      47145
[…]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22.67    0.00   65.79   11.54    0.00    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             535.83         0.00      9772.77          0      48277
[…]

To locate the application or process causing high I/O, employ the iotop utility. It is typically not installed by default but is available in the iotop package.

For problems with high I/O, besides employing different disk technologies, you will want to review the Linux kernel’s defined I/O scheduling. I/O scheduling is a series of kernel actions that handle I/O requests and their associated activities. How these various operations proceed is guided by selecting a particular I/O scheduler, shown in Table 20.7, within a configuration file.

Table 20.7 I/O schedulers

Name	Description
`cfq`	Creates queues for each process and handles the various queues in a loop while providing read request priority over write requests. This scheduler is good for situations where more balance I/O handling is needed and/or the system has a multiprocessor.
`deadline`	Batches disk I/O requests and attempts to handle each request by a specified time. This scheduler is good for situations where increased database I/O and overall reduced I/O latency are needed, and/or an SSD is employed, and/or a real-time application is in use.
`noop`	Places all I/O requests into a single FIFO queue and handles them in order. This scheduler is good for situations where less CPU usage is needed and/or an SSD is employed.

The configuration file used for determining which I/O scheduler to use is in a directory associated with each disk. Listing 20.15 shows how to find the various disk directories and their associated scheduler file on a CentOS distribution.

Listing 20.15: Locating a disk’s scheduler file

# ls /sys/block
dm-0  dm-1  sda  sdb  sdc  sdd  sr0
#
# cat /sys/block/sda/queue/scheduler
noop [deadline] cfq
#

The scheduler used for the sda disk, deadline, is in brackets. To change the current I/O scheduler, you simply employ super user privileges and the echo command as shown in Listing 20.16.

Listing 20.16: Changing a disk’s scheduler file temporarily

# echo "cfq" > /sys/block/sda/queue/scheduler
#
# cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
#

If you determine that due to hardware limitations, a new and different hard drive is needed to handle the required I/O levels, the ioping utility can help you in the testing process. The ioping utility is typically not installed by default, but it is commonly available in a distribution’s standard repositories.

The ioping utility can destroy data on your disk! Be sure to thoroughly understand the command’s options before employing it. If you desire a safer alternative, take a look at the stress-ng tool. This utility allows you to conduct stress tests for disks, your network, a system’s CPUs, and so on.

Via the ioping command, you can test disk I/O latency, seek rates, and sequential speeds. You can also try out asynchronous, cache, and direct I/O rates.

A snipped example in Listing 20.17 shows a simple test that reads random data chunks (non-cached) from a temporary file using the ioping command.

Listing 20.17: Conducting a non-cached read test using the ioping utility

# ioping -c 3 /dev/sda
4 KiB <<< /dev/sda […]: request=1 time=20.7 ms (warmup)
4 KiB <<< /dev/sda […]: request=2 time=32.9 ms
4 KiB <<< /dev/sda […]: request=3 time=25.5 ms

--- /dev/sda (block device 15 GiB) ioping statistics ---
2 requests completed in 58.4 ms, 8 KiB read, 34 iops, 137.0 KiB/s
generated 3 requests in 2.03 s, 12 KiB, 1 iops, 5.92 KiB/s
min/avg/max/mdev = 25.5 ms / 29.2 ms / 32.9 ms / 3.72 ms
#

The added -c 3 option specifies three tests. More thorough ioping tests help to determine if a disk will work for a particular application’s needs.

Failing Disks

If a small chunk of an HDD or SSD will not respond to I/O requests, the disk controller marks it as a bad sector. When a bad sector is marked, typically the controller’s firmware will attempt to move the data from the marked sector to a new location and remap the logical sector to the new sector. Thus, the data is safe.

A random bad sector does not indicate a drive is failing. However, if you are seeing bad sectors more and more on your disk, then it needs to be replaced. Thus, you should monitor this situation.

If the drive has self-monitoring analysis and reporting technology (SMART), you can employ the smartctl utility to check on its health.

Occasionally a file on the drive loses its matching inode number (covered in Chapter 3), or some other corruption occurs. This leaves the data in place, but nothing can access it, and the problem must be repaired manually.

One utility that will allow you to check and repair an ext2, ext3, or ext4 filesystem is the fsck command (covered in Chapter 11). The disk partition must be unmounted before you can run the utility on it.

For a btrfs filesystem, use the btrfs check command to check and/or repair an unmounted btrfs drive. If you have an XFS filesystem, use the xfs_check utility to check the disk and xfs_repair to check and repair the drive.

Physical damage or wear can sometimes cause unusual sounds from a drive. You may hear clicking, grinding, or scratching noises. These indicate a drive is failing and should be replaced as soon as possible.

In the cases where you do need to replace a drive and rebuild your partition(s), keep in mind that you can use the partprobe command. This nice utility allows your system to reread a disk’s partition table without rebooting the system. You do need to use super user privileges to invoke it.

Troubleshooting the CPU

You need to correctly size your CPU(s) for your server application needs. An undersized processor will force you to obtain a new one, and an oversized processor will not be used to its full potential. Both waste money.

For troubleshooting, you need to understand your CPU(s) hardware—the number of cores, whether or not hyper-threading is used, cache sizes, and so on. You can easily view your system’s current processors’ information. Use the less utility and pass it the /proc/cpuinfo filename. The first processor listed in this file is shown as processor 0.

To look at CPU usage, you can employ the uptime command. It shows how long the system has been up and running, but even more important it displays CPU load averages. Load averages are the average amount of processes waiting for or using the CPU. For example, if you have single core processor, then a 2 load average would mean that typically a process is using the CPU while another process waits.

The uptime utility displays three load average numbers — a 1-, 5-, and 15-minute average. An example is shown in Listing 20.18.

Listing 20.18: Displaying load averages with the uptime utility

$ uptime
 15:12:41 up 54 min,  2 users,  load average: 0.95, 0.93, 0.90
$

This single-core CPU system has rather high load averages, which indicates a potential serious problem. First, check for a runaway process. If there is not one, you will want to investigate items such as interrupts from network and disks. The top utility can help here.

For a single-core CPU, a consistent load average above 0.70 indicates a problem. Consistent load averages of 1.00 are at emergency levels.

If you need to view CPU performance over time, the sar (acronym for system activity reporter) utility is useful. It’s typically installed by default on most distributions, but if you need to install it, use the sysstat package.

The sar utility uses data stored by the sadc program in the /var/log/sa/ directory, which contains up to a month’s worth of data. By default, it displays data from the current file. Used without any options or arguments, sar will display today’s stored CPU usage information in 10-minute intervals, as shown snipped in Listing 20.19.

Listing 20.19: Displaying CPU usage with the sar utility

$ sar -u
[…]
03:20:28 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
03:30:18 PM     all     32.15      0.00     67.85      0.00      0.00      0.00
03:40:01 PM     all     19.07      0.00     26.88      0.00      0.00     54.05
[…]
Average:        all     20.85      0.00     24.51      0.01      0.00     54.64
[…]

If your server is running multiple virtual machines, the %steal column of the sar utility output is handy. This column shows how much CPU is being utilized by virtual machines on the system.

You may be able to improve CPU performance by modifying certain kernel parameters via the sysctl utility. For example, if your server has multiple processors, they may experience jitter (similar to network jitter), causing spikes in their performance and resulting in application slowness (high latency). If you set it to 1, the skew_tick parameter can reduce this jitter.

Troubleshooting Memory

Processes use random access memory (RAM) to temporarily store data because it is faster to access than data stored on a disk. One form of this is disk buffering, which improves disk read performance. Data is read from the disk and stored for a period of time in a memory location called a buffer cache. Subsequent accesses of that data are read from memory rather than disk, which significantly improves performance.

The speed that memory provides to processes is so valuable that the Linux kernel maintains and administers shared memory areas. These shared segments allow multiple running programs to read/write from/to a common shared memory data area, which considerably speeds up process interactions.

You can see detailed information concerning your system’s RAM by viewing the /proc/meminfo file. To view shared memory segments, use the ipcs -m command. You can view memory statistics on a system using command-line tools such as free, sar, and vmstat.

Be aware that RAM bottlenecks often keep CPU usage artificially low. If you increase the RAM on your system, your processor loads may also increase.

Swapping

Memory is divided up into chunks called pages. When the system needs more memory, using a memory management scheme, it takes an idle process’s memory pages and copies them to disk. This disk location is a special partition called swap space or swap or virtual memory. If the idle process is no longer idle, its memory pages are copied back into memory. This process of copying memory pages to/from the disk swap space is called swapping.

If your system does not have properly sized memory, you should see high RAM usage via the free command. In addition, the system will increase swapping, which results in increased disk I/O. The vmstat tool is handy in this case because it will allow you to view disk I/O specific to swapping as well as total blocks in and blocks out to the device. An example of using the vmstat utility is shown in Listing 20.20.

Listing 20.20: Displaying virtual memory statistics with the vmstat utility

$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 3149092   3220 355180    0    0  1978    43  812  783 27 17 26 29  0
$

On a Linux system, swap space is either a disk partition or a file. A swap partition is a special disk partition that acts as the system’s swap space.

It is generally recommended that you do not store your swap partitions/files on SSDs. This is due to their limited life span and wear-leveling. Heavy swapping could cause an early SSD death.

A useful utility for viewing memory and determining if swap is a file or a partition is the swapon -s command. If it’s available, you can obtain the same information from the /proc/swaps file. An example of using the swapon -s command on a CentOS distribution is shown in Listing 20.21

Listing 20.21: Displaying a swap partition with the swapon utility

$ swapon -s
Filename         Type            Size    Used    Priority
/dev/dm-1        partition       1572860 0       -1
$

Notice that on this CentOS distro, the swap space is a partition. The priority column within the preceding example’s swap space statistics is shown as a negative one (-1). If there are multiple swap spaces, this priority number determines which swap is used first.

Another swapon -s example is enacted on an Ubuntu distribution and shown in Listing 20.22. Notice that in this case, the swap space is a file.

Listing 20.22: Displaying a swap file with the swapon utility

$ swapon -s
Filename        Type            Size    Used    Priority
/swapfile       file            483800  0       -2
$

During the Linux OS installation, typically a swap partition or file is created and added to the /etc/fstab configuration file. However, you may need to create additional swap partitions/files—for example, if you increase your system’s RAM.

If your swap partition is on a logical volume, add additional space via the LVM tools. Simply follow the steps covered in the section “Troubleshooting Storage Issues” earlier in this chapter as well as the following section. However, you can have more than one system swap space. In fact, in many cases it is desirable to do so for performance reasons.

For a new partition swap space, once you’ve created a new disk partition, use the mkswap command to “format” the partition into a swap partition. You use the same command for a new swap file and a logical volume. An example on a CentOS system, using super user privileges, is shown in Listing 20.23.

Listing 20.23: Making a swap partition with the mkswap command

# mkswap /dev/sde1
Setting up swapspace version 1, size = 1048572 KiB
no label, UUID=e5bd150a-2f06-42ed-a0a9-a7372abd9dee
#
# blkid /dev/sde1
/dev/sde1: UUID="e5bd150a-2f06-42ed-a0a9-a7372abd9dee" TYPE="swap"
#

Now that the swap partition or file has been properly prepared, activate it using the swapon command. The free command is very useful here because it provides a simple view of your current free and used memory. An example of these two commands is shown in Listing 20.24.

Listing 20.24: Activating a swap partition with the swapon command

# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.7G        359M        3.0G        9.3M        352M        3.1G
Swap:          1.5G          0B        1.5G
#
# swapon /dev/sde1
#
# swapon -s
Filename           Type            Size    Used    Priority
/dev/dm-1          partition       1572860 0       -1
/dev/sde1          partition       1048572 0       -2
#
# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.7G        366M        3.0G        9.3M        352M        3.1G
Swap:          2.5G          0B        2.5G
#

You can see that the swap space size has increased 1GB by adding a second swap partition. The -h option on the free command displays memory information in a more human-readable format.

The free command’s buffer cache output is displayed in the buff/cache column. Older versions of this command would show two columns for this data — buffers and cache. Linux divides up its buffer cache into distinct categories. Buffers are memory used by kernel buffers, while cache is memory used by process page caches and slabs (contiguous memory pages set aside for individual caches). The buff/cache column in the modern free command’s output is simply a summation of these two memory use categories.

If desired, change the new swap partition’s priority from its current negative two (-2) to a higher priority, using the swapon command as shown in Listing 20.25. A higher number designates that the swap partition is used before other partitions for swap.

Listing 20.25: Changing a swap priority with the swapon command

# swapoff /dev/sde1
#
# swapon -p 0 /dev/sde1
#
# swapon -s
Filename        Type            Size    Used    Priority
/dev/dm-1       partition       1572860 0       -1
/dev/sde1       partition       1048572 0       0
#

You must first use the swapoff command on the swap partition to disengage it from swap space. After that the swapon -p priority is used to change the preference priority. You can set priority to any number between 0 and 32767.

If you want to move your system to a new swap partition or file, do not use the swapoff command on a current swap partition/file until your new swap partition is added to swap space. Otherwise, you may end up with a hung system.

If all is well with the new swap partition, add it to the /etc/fstab file so it is persistent through system reboots. You can closely mimic the current swap file record’s settings, but be sure to change the name to your new swap partition/file.

Running Out of Memory

By default, the Linux kernel allows itself to overcommit memory to various processes. This is done for efficiency and performance. However, due to this allowance, the system can become very low on free memory. In a critical low-memory situation, Linux first reclaims old memory pages. If it doesn’t reclaim enough RAM to come out of a critical status, it employs the out of memory killer (also called the OOM killer).

When triggered, the OOM killer scans through the various processes using memory and creates a score. The score is based on the total memory a process (and its child processes) is using and the smallest number of processes that can be killed to come out of a critical low-memory status. The kernel, root, and crucial system processes are automatically given low scores. If a process has a high score, it is killed off. The OOM killer scans and kills off high scoring processes until the system is back to normal memory status.

If you want to modify the behavior of the OOM killer, you can do so via the following kernel parameters with the sysctl command: vm.panic_on_oom and kernel.panic.

You can force the kernel to prevent memory overcommit via the sysctl command, changing the vm.overcommit_memory kernel parameter from its default of 0 to 1. However, this may not be the best solution. In many systems, it is better to fine-tune the memory overcommit by setting vm.overcommit_memory to 2. This allows you to allocate as much memory as defined in another kernel parameter, the overcommit_ratio. In this case, when a process requests memory that causes system to exceed the set overcommit ratio, the allocation fails.

While you can modify kernel parameters with the sysctl command, they are not persistent. To make the settings persistent, make the appropriate edits to /etc/sysctl.conf or your distribution’s applicable sysctl configuration file.

Surviving a Lost root Password

Forgetting the root account’s password is troublesome for many reasons. The quick fix is to reset it via the passwd command using your own account’s super user privileges. However, if you were using the root account to gain super user privileges (which is a bad practice) or your privileges do not allow you to change the root password, you are in trouble. But all hope is not lost.

On older Linux distros and a few modern ones (Ubuntu), booting the system into single user mode will allow you to access the root account and change its password via the passwd command. To do so, follow these steps:

Boot (or reboot) the system. When the boot process reaches the boot menu, press the E key on the boot menu line you wish to edit (the kernel version the system typically runs).
Find the line that contains linux or linux16 via an arrow key.
Go to the line’s end, press the spacebar once, and type in 1.
Press Ctrl+X to boot the system.
Once the system is booted, press the Enter key if it states you are in Emergency or Rescue mode.
Change the root account’s password via the passwd command.
Reboot the system via the reboot command.

On some modern Linux systems, such as CentOS and Fedora distros, you’ll need a slightly different approach. Follow these steps:

Boot (or reboot) the system. When the boot process reaches the boot menu, press the E key on the boot menu line you wish to edit (the kernel version the system typically runs).
Find the line that contains linux or linux16 via an arrow key.
On that line, find ro. This is somewhere in the line’s middle.
Replace ro with rw init=/sysroot/bin/sh. Don’t replace anything else in that line but ro.
Press Ctrl+X to boot the system.
Once the system is booted, press the Enter key if it states you are in Emergency or Rescue mode.
Type chroot /sysroot to set up a jailed root environment.
Change the root account’s password via the passwd command.
If the system uses SELinux (CentOS and Fedora typically do), force SELinux to automatically relabel the system on the next boot by typing the touch /.autorelabel command.
Reboot the system via the reboot command. (Note: You may need to type exit and press Enter before trying to reboot your system.)
You’ll have to wait a while for the system to go through its SELinux relabel process.

Summary

Troubleshooting Linux performance issues requires planning ahead for adverse incidents. In addition, you must understand the interactions between the various system properties, such as processors, disks, networks, and memory. Properly sizing system components and configuring Linux will provide a more trouble-free environment.

Exam Essentials

Describe network troubleshooting tools. If your network is experiencing high latency, the tools to help troubleshoot this include iperf, iperf3, iftop, mtr, nc (Netcat), ping, ping6, ss, tracepath, tracepath6, traceroute, and traceroute6. These utilities also assist in detecting network saturation problems. If failing or faulty adapters are a problem, the tools to diagnose this issue are ethtool, ifconfig, ip, and netstat. These utilities along with nmcli also help with NIC configuration problems. For incorrect or duplicate MAC addresses in a router, employ the arp or ip neigh commands. To research slow or incorrect name server responses, the host, dig, nslookup, and whois utilities help.

Summarize potential disk problems and solutions. The du and df utilities help in preventing the system from running out of filesystem space and with troubleshooting when it does. If it is a logical volume, employ the LVM tools to add additional space when needed. I/O wait times, which may slow overall system performance, are seen with the iostat command. Changing a system’s I/O scheduler may help relieve this problem. The ioping utility tests a disk to determine if it is usable for a particular application. To repair an ext* filesystem, use the fsck command. The partprobe command works well for newly created partitions, in that it forces a re-read of a disk’s partition table without rebooting the system.

Clarify CPU troubleshooting procedures. It is important to understand your system’s current processors’ information, which you can find in the /proc/cpuinfo file. To view CPU usage, employ the uptime and/or the sar commands. If needed and appropriate, you can tweak kernel parameters related to processor handling via the sysctl utility.

Explain memory problems and solutions. To view detailed system RAM information, look at the /proc/meminfo file’s contents. If your system does not have properly sized memory, you can see high RAM usage via the free command. In addition, the vmstat tool allows you to view disk I/O specific to swapping, which increases when RAM is improperly sized. If you need to add additional swap space, the mkswap utility will “format” a partition/file into swap, and the swapon command will put it into swap space. If you need to uncouple a partition/file from swap space, use the swapoff utility. If memory use hits critical levels, the kernel releases the OOM killer, which kills off particular processes to bring memory usage back to reasonable levels. Memory management can be modified via certain kernel parameters and the sysctl tool.

Review Questions

Which of the following is true concerning network sockets? (Choose all that apply.)
1. Numbers used to identify which service is transmitting data
2. A single endpoint of a network connection’s two endpoints
3. Uses a combination of an IP address and a port number
4. Endpoints between processes on a local system
5. Provides better IPC than localhost
The system administrator, Preston, has noticed that the IPv4 network seems sluggish. He decides to run some tests to check for high latency. Which of the following utilities should he use? (Choose all that apply.)
1. iperf
2. ping
3. ip neigh
4. dig
5. traceroute
Mr. Scott has formulated a problem cause theory that routers are saturated with traffic and dropping TCP packets from their queues. Which of the following tools should he employ to test this theory? (Choose all that apply.)
1. mtr
2. ifconfig
3. ethtool -s
4. tracepath
5. traceroute
The network engineer, Keenser, believes the choices of name servers in the system’s /etc/resolv.conf file are inefficient. Which of the following tools can he employ to test new server choices?
1. dnsmasq
2. whois
3. nmap
4. nslookup
5. ipset list
Mera, a Linux system admin, believes a new application on her system is producing too much I/O for a particular partition, causing the system’s processor to appear sluggish. Which tool should she use to test her problem cause theory?
1. iostat
2. ioping
3. du
4. df
5. iotop
From analysis, Arthur believes the system’s I/O throughput will improve by changing the I/O scheduler. On his system is a real-time application, which uses a database located on a solid-state drive. Which I/O scheduler should Arthur choose?
1. scheduler
2. deadline
3. queue
4. cfq
5. noop
Using the uptime command, you will see CPU load averages in what increments? (Choose all that apply.)
1. 1 minute
2. 5 minutes
3. 10 minutes
4. 15 minutes
5. 20 minutes
Mary wants to view her system’s processor performance over time. Which is the best utility for her to employ?
1. uptime
2. sysstat
3. sar
4. cat /proc/cpuinfo
5. sysctl
Gertie needs to determine a swap space element’s type, name, and priority. Which command should she use?
1. vmstat
2. free
3. fstab
4. swapoff
5. swapon -s
Elliot is administering a Linux system that has multiple swap spaces. One is on a logical volume, but it needs more space to accommodate additional RAM that is to be installed in the near future. What is the best way for Elliot to add swap space?
1. Add a partition and format it with mkswap command.
2. Add a file and format it with mkswap command.
3. Add a partition via the swapon utility.
4. Add a file via the swapon utility.
5. Use LVM tools to increase the logical volume.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 20 Analyzing System Properties and Remediation

Create new playlist

Sign In

Sign Up