21
SYSTEM PERFORMANCE AND MONITORING

image

Even if “it’s slow!” isn’t the most dreaded phrase a system administrator can hear, it’s pretty far up on the list. The user doesn’t know why the system is slow and probably can’t even quantify or qualify the problem any further than that. It just feels slow. Usually there’s no test case, no set of reproducible steps, and nothing particularly wrong. A slowness complaint can cause hours of work as you dig through the system trying to find problems that might or might not even exist.

One phrase is more dreadful still, especially after you’ve invested those hours of work: “it’s still slow.”

An inexperienced sysadmin accelerates slow systems by buying faster hardware. This exchanges “speed problems” for costly parts and even more expensive time. Upgrades just let you conceal problems without actually using the hardware you already own, and sometimes they don’t even solve the problem at all.

You can frequently solve performance problems by tweaking the software that’s causing the problems. Your WordPress site is slow? Investigate running PHP under memcached or another PHP accelerator. FreeBSD is only one layer of your application stack, so be sure to give the other layers proper attention.

FreeBSD includes many tools designed to help you examine system performance and provide the information necessary to learn what’s actually slowing things down. Some of them, such as dtrace(1), are highly complicated and require extensive knowledge of the system, the software, and a book of their own. Once you understand where a problem is, identifying the solution to the problem becomes much simpler. You might actually need faster hardware, but sometimes shifting system load or reconfiguring software might solve the problem at much less expense. In either case, the first step is understanding the problem.

Computer Resources

Performance problems are usually caused by running more tasks than the computer can handle. That seems obvious, but think about it a moment. What does that really mean?

A computer has four basic resources: input/output, network bandwidth, memory, and CPU. If any one of them is filled to capacity, the others can’t be used to their maximum. For example, your CPU might very well be waiting for a disk to deliver data or for a network packet to arrive. If you upgrade your CPU to make your system faster, you’ll be disappointed. Buying a whole new server might fix the problem, but only by expanding the existing bottleneck. The new system probably has more memory, faster disks, a better network card, and faster processors than the old one. You have deferred the problem until the performance reaches some new limit. However, by identifying where your system falls short and addressing that particular need, you can stretch your existing hardware much further. After all, why purchase a whole new system when a few gigabytes of relatively inexpensive memory would fix the problem? (Of course, if your goal is to retire this “slow” system to make it your new desktop, that’s another matter.)

Input/output is a common bottleneck. System busses have a maximum throughput, and while you might not be pushing your disk or your network to their limits, you might be saturating the bus by continually bombarding both.

One common cause of system slowdowns is running multiple large programs simultaneously. Not only does disk I/O become saturated, but the processors might spend the majority of their time waiting to swap data between the on-CPU cache and the memory. For example, I once thoughtlessly scheduled a massive database log rotation that moved and compressed gigabytes of data at the same time as the daily periodic(8) run. Since the job required shutting down the main database and caused application downtime, speed was crucial. Both the database job and the periodic(8) run slowed unbearably. Rescheduling one of them made both jobs go more quickly.

FreeBSD has some features that improve performance. Doing lots of cryptographic operations? Use the aesni(4) kernel module. Database is disk bound? Consider the filesystem block size. ZFS pool slow? Maybe you need an add-on cache. Identifying what you should change requires a hard look at the system, however.

We’re going to look at several FreeBSD tools for examining system performance. Armed with that information, we’ll consider how to fix performance issues. Each potential bottleneck can be evaluated with the proper tools. FreeBSD changes continually, so later systems might have new tuning options and performance features. Read tuning(7) on your system for current performance tips.

Checking the Network

If you’re concerned about network performance, measure it. Consult netstat -m and netstat -s, and look for errors or places where you’re out of memory or buffers. These are instantaneous snapshots, but for the network, you really need to evaluate congestion and latency over minutes, hours, and even days. The network team probably has a tool like Cacti, Zabbix, or Graphite to observe long-term performance.1 Ask them for information. Combine what these tools provide with your instantaneous snapshots. If the average throughput per minute on your 10-gig Ethernet is only 5 gigabit a second, but your instantaneous measurements show frequent spikes up to the full 10 gigabit, you probably have really bursty connectivity.

Some network cards can better handle a full network in polling mode. Polling tells the network card to stop sending frames up to the operating system as they arrive and instead let the operating system visit every so often to collect the frames. Check your network card’s man page to see whether it supports polling. Enable and disable polling with ifconfig(8).

A heavily loaded network might benefit from a different congestion control algorithm. FreeBSD provides several TCP congestion control algorithms. Look for files beginning with cc_ in /boot/kernel; these are congestion control modules. Each has a man page.

View the currently loaded congestion control algorithms with the sysctl net.inet.tcp.cc.available.

# sysctl net.inet.tcp.cc
net.inet.tcp.cc.available: newreno

New Reno is the traditional congestion control algorithm. The congestion control kernel modules on this system include CDG, CHD, CUBIC, DCTCP, HD, H-TCP, and Vegas. The H-TCP algorithm is specifically designed for long-distance, high-bandwidth applications. Let’s enable it.

# kldload /boot/kernel/cc_htcp.ko
# sysctl net.inet.tcp.cc.available
net.inet.tcp.cc.available: newreno, htcp

We now have H-TCP available in the kernel. Enable it with the net.inet.tcp.cc.algorithm sysctl.

# sysctl net.inet.tcp.cc.algorithm=htcp
net.inet.tcp.cc.algorithm: newreno -> htcp

Ultimately, you can’t fit 10 pounds of bandwidth in a 5-pound circuit. If your saturated Ethernet is crippling your applications, turn off unnecessary network services or add more bandwidth.

Other system conditions are much more complicated. Start by checking where the problem lies with vmstat(8).

General Bottleneck Analysis with vmstat(8)

FreeBSD includes several programs for examining system performance. Among these are vmstat(8), iostat(8), and systat(1). We’ll discuss vmstat(8) because I find it most helpful; iostat(8) is similar to vmstat(8), and systat(1) provides the same information in an ASCII graphical format.

Use vmstat(8) to see the system’s current virtual memory statistics. While the output takes getting used to, vmstat(8) is very good at showing large amounts of data in a small space. Type vmstat at the command prompt and follow along.

# vmstat
procs  memory      page                    disks        faults     cpu
r b w  avm   fre   flt  re  pi  po    fr   sr ad0 ad1   in    sy    cs us sy id
8 0 0 1.3G   26G   157   0   1   0   172    1   0   0   12   212   149  0  0 100

The vmstat divides its display into six sections: process (procs), memory, paging (page), disks, faults, and cpu. We’ll look at all of them quickly and then discuss in detail those parts that are the most important for investigating performance issues. This single line represents the average values for the whole time the system has been running. We’ll get more real-time data in the next section.

Processes

vmstat(8) has three columns under the procs heading. Technically, vmstat counts threads rather than processes. Unthreaded applications have one thread per process, but your multithreaded application could have far, far more.

r The number of runnable threads that are waiting for CPU time, including all running processes. One thread per CPU is fine; it means your hardware is fully utilized. More than that means your CPU is a bottleneck. Some programs demand all the processor the host has and more, though; check that you’re not running such a remorseless compute suck.

b The number of threads that are blocked waiting for system input or output—generally, waiting for disk access. These threads will run as soon as they get their data. If this number is high, your disk is the bottleneck.

w The number of threads that are runnable but are entirely swapped out. If you regularly have processes swapped out, the system’s memory is inadequate for the host’s workload.

This host has averaged eight runnable threads since boot, but zero waiting on I/O or memory. If you’re getting complaints that this host is slow, the first place to check is processor utilization. Is someone, say, building FreeBSD from source just to generate interesting output for a book’s performance chapter, while real people are attempting to do their jobs on the same system?

Memory

FreeBSD breaks memory up into uniform-sized chunks called pages. When a program requests memory, it gets assigned a number of pages. The size of a page is hardware- and OS-dependent but appears in the hw.pagesize sysctl. On FreeBSD’s i386 and amd64 platforms, a page is 4KB. The system treats each page as a whole—if FreeBSD must shift memory into swap, for example, it does that on a page-by-page basis. The kernel thread that manages memory is called the pagedaemon. The memory section has two columns.

avm The average number of pages of virtual memory that are in use. If this value is abnormally high or increasing, your system is actively consuming swap space.

fre The number of memory pages available for use. If this value is abnormally low, you have a memory shortage.

Our sample output is using 1.3GB of RAM and has 26GB free. Memory isn’t an issue.

Paging

The page section shows how hard the virtual memory system is working. The inner workings of the virtual memory system are an arcane science that I won’t describe in detail here.2

flt The number of page faults, where information needed wasn’t in real memory and had to be fetched from swap space or disk.

re The number of pages that have been reclaimed or reused from cache.

pi Short for pages in; this is the number of pages being moved from real memory to swap.

po Short for pages out; this is the number of pages being moved from swap to real memory.

fr How many pages are freed per second.

sr How many pages are scanned per second.

Moving memory into swap isn’t bad, but consistently recovering paged-out memory indicates a memory shortage. Having high fr and flt values can indicate lots of short-lived processes—for example, a script that starts many other processes or a cron job scheduled too frequently. Or perhaps someone’s been running make -j16 buildworld. A high sr probably means you don’t have enough memory, as the pagedaemon is constantly trying to free memory. The paging daemon normally runs once a minute or so, but a high sr count means you’re probably trying to do more work than your RAM can hold.

Disks

The disks section shows each of your disks by device name. The number shown is the number of disk operations per second, a valuable clue to determining how well your disks are handling their load. You should divide your disk operations between different disks whenever possible and arrange them on different buses when you can. If one disk is obviously busier than the others, and the system has operations waiting for disk access, consider moving some frequently accessed files from one disk to another. One common cause of high disk load is a coredumping program that can restart itself. For example, a faulty CGI script that dumps core every time someone clicks on a link will greatly increase your disk load.

If you have a lot of disks, you might notice that they don’t all appear on the vmstat display. Designed for an 80-column display, vmstat(8) can’t list every disk on a large system. If, however, you have a wider display and don’t mind exceeding the 80-column limit, use the -n flag to set the number of drives you want to display.

Faults

Faults aren’t inherently bad; they’re just received system traps and interrupts. An abnormally large number of faults is bad, of course—but before you tackle this problem, you need to know what’s normal for your system.

The first line of vmstat output shows the average faults per second since system boot.

in The number of system interrupts (IRQ requests) received

sy The number of system calls

cs The number of context switches in the last second, or a per-second average since the last update. (For example, if you have vmstat update its display every five seconds, this column displays the average number of context switches per second over the last five seconds.)

This host has averages 12 system calls and 212 context switches per second since boot. How does that compare to what you saw when the system was working normally?

CPU

Finally, the cpu section shows how much time the system spent doing user tasks (us), system tasks (sy), and how much time it spent idle (id). top(1) presents this same information in a friendlier format, but only for the current time, whereas vmstat lets you view system utilization over time.

Using vmstat

So, how do you use all this information? Start by checking the first three columns to see what the system is waiting for. If you’re waiting for CPU access (the r column), then you’re short on CPU horsepower. If you’re waiting for disk access (the b column), then your disks are the bottleneck. If you’re swapping (the w column), you’re short on memory. Use the other columns to explore these three types of shortages in more detail.

Continuous vmstat

You’re probably more interested in what’s happening over time, rather than in a brief snapshot of system performance. Use the -w flag and a number to run it as an ongoing display updating every so many seconds. FreeBSD shows average values since the last update, updating counters continuously:

# vmstat -w 5
procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr ad0 ad1   in    sy    cs us sy id
8 0 0 1.6G   25G   415   0   1   0   432    6   0   0   12   281   157  1  0 99
8 0 0 2.4G   24G 53089   0   7   0 11188  561  11   8   45  8789   994 96  4  0
8 0 0 2.5G   24G 44600   0   3   0 38703  741  10   9   49  8806  1032 96  3  1
8 0 0 2.2G   24G 42841   0  15   0 58044  717  11   9   52 10271  1103 96  4  0
--snip--

The first line still shows the averages since boot. Every five seconds, however, an updated line appears at the end. You can sit there and watch how your system’s performance changes when scheduled jobs kick off or when you start particular programs. Hit CTRL-C when you’re done. In this example, processes are always waiting for CPU time (as shown by the stack of 8s in the r column), and we frequently have something waiting for disk access.

An occasional wait for a system resource doesn’t mean you must upgrade your hardware; if performance is acceptable, don’t worry about it. Otherwise, however, you must look further. The most common culprit is the storage system.

Disk I/O

Disk speed is a common performance bottleneck, especially with spinning disks, but even flash-based storage can get slow. Programs that must repeatedly wait for disk activity to complete run more slowly. This is commonly called blocking on disk, meaning that the disk is preventing program activity. The only real solution for this is to use faster disks, install more disks, or reschedule the load.

While FreeBSD provides several tools to check disk activity, my favorite is gstat(8), so we’ll use that. You can run gstat without arguments for a display of all of your disks and partitions that updates every second or so. If you have many disks this can generate a whole bunch of zeros, though. I always use the -a flag, so that gstat(8) displays only disks with activity. The -p flag is also useful, to view entire disks, but I prefer a per-partition view.

# gstat -a
dT: 1.002s  w: 1.000s
 L(q)  ops/s    r/s   kBps ms/r    w/s   kBps ms/w   %busy Name
   0    120      0      0    0.0    118    331    0.1   12.1| ada1
   0    120      0      0    0.0    118    331    0.1   12.1| ada1p1
   0     21      0      0    0.0     19    351    0.4    8.2| da1
   0     20      0      0    0.0     18    331    0.1   12.1| gpt/zfs4
   0     21      0      0    0.0     19    351    0.4    8.2| da1p1
   0     21      0      0    0.0     19    351    0.4    8.2| gpt/zfs7

We get a line for each disk device, slice, and partition, and various information for each. gstat(8) shows all sorts of good stuff, such as the number of reads per second (r/s), writes per second (w/s), the kilobytes per second of reading and writing, as well as a friendly-looking %busy column.

Ignore most of these. Some of these, such as the percent busy column, use sloppy measuring methods. The FreeBSD developers chose disk performance over accuracy of statistical measurements. What does matter, however, are the ms/r (milliseconds per read) and ms/w (milliseconds per write) . These numbers are accurate. Measure and monitor them. If one disk has really high activity, but another is idle, consider dividing what’s on that disk between multiple disks or using striped storage. Or, if it’s your laptop, consider accepting that this is as fast as your storage system gets.

Once you identify the scarce system resource, you need to figure out what program’s draining that resource. We’ll need other tools for that.

CPU, Memory, and I/O with top(1)

The top(1) tool provides a decent overview of system status, displaying information about CPU, memory, and disk usage. Just type top to get a full-screen display of system performance data. The display updates every two seconds, so you have a close to real-time system view. Even if you update the update interval to one second, you can still miss short-lived, resource-sucking processes.

The output of top(1) is split into two halves. The upper portion gives basic system information, while the bottom gives per-process data.

last pid: 84111;  load averages:  0.09,  0.21,  0.20               up 7+07:58:00  14:41:09
28 processes:  2 running, 26 sleeping
CPU:  0.0% user,  0.0% nice,  0.9% system,  0.0% interrupt, 99.1% idle
Mem: 80M Active, 642M Inact, 124M Laundry, 222M Wired, 17M Free
Swap: 1024M Total, 83M Used, 941M Free, 8% Inuse

  PID USERNAME       THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
  479 bind             4  20    0 99444K 35956K kqread   6:55   0.00% named
  586 root             1  20    0   154M 33768K select   4:54   0.00% perl
  562 root             1  20    0 22036K 13948K select   1:27   0.00% ntpd
--snip--

Very tightly packed, isn’t it? The top(1) tool crams as much data as possible into a standard 80 × 25 terminal window or X terminal. Let’s take this apart and learn how to read it. We’ll start with the upper part, which can look a little different depending on whether you’re using UFS or ZFS.

UFS and top(1)

The per-host information at in the upper part of top(1) varies slightly between ZFS and UFS hosts, but we’ll start with UFS and then explain the differences.

PID Values

Every process on a Unix machine has a unique process ID (PID). Whenever a new process starts, the kernel assigns it a PID one greater than the previous process. The last PID value is the last process ID assigned by the system. In the previous example, our last PID is 84,111 . The next process created will be 84,112, then 84,113, and so on. Watch this number to see how fast the system changes. If the system is running through PIDs more quickly than usual, it might indicate a process forking beyond control or something crashing and restarting.

Load Average

The load average is a somewhat vague number that offers a rough impression of the amount of CPU load on the system. The load average is the average number of threads waiting for CPU time. (Other operating systems have different load average calculation methods.) An acceptable load average depends on your system. If the numbers are abnormally high, you need to investigate. Some hosts feel bogged down at a load average of 3, while some modern systems are still snappy with what look like ridiculously high load averages. Again, what’s normal for this host?

You’ll see three load averages. The first (0.09 here) is the load average over the last minute, the second (0.21) is for the last five minutes, and the last (0.20) is for the last 15 minutes. If your 15-minute load average is high, but the 1-minute average is low, you had a major activity spike that has since subsided. On the other hand, if the 15-minute value is low but the 1-minute average is high, something happened within the last 60 seconds and might still be going on now. If all of the load averages are high, the condition has persisted for at least 15 minutes.

Uptime

The last entry on the first line is the uptime , or how long the system has been running. This system has been running for 7 days, 7 hours, and 58 minutes, and the current time is 14:41:09. I’ll leave it up to you to calculate what time I booted this system.

Process Counts

On the second line, you’ll find information about the processes currently running on the system . Running processes are actually doing work—they’re answering user requests, processing mail, or doing whatever your system does. Sleeping processes are waiting for input from one source or another; they’re just fine. You should expect a fairly large number of sleeping processes at any time. Processes in other states are usually waiting for a resource to become available or are hung in some way. Large numbers of nonsleeping, nonrunning processes hint at trouble. The ps(1) command can show the state of all processes.

Process Types

The CPU states line indicates what percentage of available CPU time the system spends handling different types of processes. It shows five different process types: user, nice, system, interrupt, and idle.

The user processes are average everyday programs—perhaps daemons run by root, or commands run by regular users, or whatever. If it shows up in ps -ax, it’s a user process.

The nice processes are user processes whose priority has been deliberately manipulated. We’ll look at this in detail in “Reprioritizing with Niceness” on page 543.

The system value gives the total percentage of CPU time spent by FreeBSD running kernel processes and the userland processes in the kernel. These include things such as virtual memory handling, networking, writing to disk, debugging with INVARIANTS and WITNESS, and so on.

The interrupt value shows how much time the system spends handling interrupt requests (IRQs).

Last, the idle entry shows how much time the system spends doing nothing. If your CPU regularly has a very low idle time, you might want to think about rescheduling jobs or getting a faster processor.

Memory

The Mem line represents the usage of physical RAM. FreeBSD breaks memory usage into several different categories.

Active memory is the total amount of memory in use by user processes. When a program ends, the memory it had used is placed into inactive memory. If the system runs this program again, it can retrieve the software from memory instead of disk.

Free memory is totally unused. It might be memory that has never been accessed, or it might be memory released by a process. This system has 17MB of free RAM. If you have a server that’s been up for months, and it still has free memory, you might consider putting some of that RAM in a machine that’s hurting for memory.

Memory in the Laundry is queued to be synchronized with other storage, such as disk.

FreeBSD 11 shuffles memory between the inactive, laundry, and free categories as needed to maintain a pool of available memory. Memory in the inactive is most easily transferred to the free pool. When cache memory gets low and FreeBSD needs still more free memory, it picks pages from the inactive pool, verifies that it can use them as free memory, and moves them to the free pool. FreeBSD tries to keep the total number of free pages above the sysctl vm.v_free_target.

FreeBSD 12 has no cache and handles low memory situations a little differently. When free memory gets low, the pagedaemon picks pages from the inactive pool. If that inactive page needs to be synced to disk, it’s placed on the laundry queue, and the pagedaemon tries another inactive page. One way to test whether a host needs more RAM is if the pagedaemon is accumulating CPU time from all this testing.

On either FreeBSD version, having free memory doesn’t mean that your system has enough memory. If vmstat(8) shows that you’re swapping at all, you’re using more physical memory than you have. You might have a program that releases memory on a regular basis. Also, FreeBSD will move some pages from inactive to free in an effort to maintain a certain level of free memory.

FreeBSD uses wired memory for in-kernel data structures, as well as for system calls that must have a particular piece of memory immediately available. Wired memory is never swapped or paged. All memory used by ZFS is wired.

Swap

The Swap line gives the total swap available on the system and how much is in use. Swapping is using the disk drive as additional memory. We’ll look at swap in more detail later in the chapter.

ZFS and top(1)

The output of top(1) on a ZFS system looks superficially different, but the per-host handling of memory has important differences.

  last pid: 53202;  load averages:  0.26,  0.28,  0.30      up 1+15:41:48
  13:50:54
  120 processes: 1 running, 119 sleeping
  CPU:  0.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Mem: 288M Active, 205M Inact, 3299M Wired, 137M Free
ARC: 2312M Total, 458M MFU, 1626M MRU, 420K Anon, 38M Header, 189M Other
      1918M Compressed, 8885M Uncompressed, 4.63:1 Ratio
  Swap: 2048M Total, 126M Used, 1922M Free, 6% Inuse

    PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  53202 mwlucas       1  20    0 20124K  3388K CPU0    0   0:00   0.08% top
    835 mysql        26  23    0   629M   219M select  1  62:42   0.03% mysqld
  53151 www           1  20    0   237M 12924K select  2   0:00   0.03% httpd
    863 nobody        7  20    0 34928K  4960K kqread  0   0:31   0.02% memcached
  53058 www           1  20    0   239M 13296K lockf   0   0:00   0.01% httpd
    852 root          1  20    0   166M 11716K kqread  2   0:05   0.01% php-fpm
  --snip--

The Mem section lists Active, Inactive, Laundry, Wired, and Free memory familiar from UFS output.

The ARC line represents ZFS’s Advanced Replacement Cache. The Total field shows the amount of memory the entire ARC uses. Within the 2,312MB used by the cache, 458MB are in the Most Frequently Used (MFU) cache, while 1,626MB are in the Most Recently Used (MRU) cache. You’ll also see much smaller entries for ZFS internal data structures, such as anonymous buffers (Anon), ZFS headers (Header), and the ever-useful Other.

ZFS compresses the ARC , exchanging plentiful CPU time for scarce memory. You can see the amount of space used by compressed and uncompressed cached data.

ZFS is greedy for memory, provided nothing else wants it. ZFS aggressively caches data read from and written to disk. This host has 4,096MB of RAM, and ZFS has claimed 2,312MB of that. You’ll see that this host has only 137MB free. If a program requests memory and the system doesn’t have it available, ZFS will release some of its cache back to the system. If you see a high wired memory level, remind yourself that all memory claimed by ZFS goes into the “wired” bucket.

This is a long-winded way of saying, “Don’t let apparent high ZFS memory usage worry you.” Worry only if the host starts paging and swapping.

More interesting is the list of processes that are using that memory.

Process List

Finally, top(1) lists the processes on the system and their basic characteristics. The table format is designed to present as much information as possible in as little space as possible. Every process has its own line.

PID First, we have the process ID number, or PID. Every running process has its own unique PID. When you use kill(1), specify the process by its PID. (If you don’t know the PID of a process, you can use pkill(1) to kill the process by its name.)

Username Next is the username of the user running the process. If multiple processes consume large amounts of CPU or memory, and they’re all owned by the same user, you know whom to talk to.

Priority and niceness The PRI (priority) and NICE columns are interrelated and indicate how much precedence the system gives each process. We’ll talk about priority and niceness a little later in this chapter.

Size SIZE gives the amount of memory that the process has requested.

Resident memory The RES column shows how much of a program is actually in memory at the moment. A program might request a huge amount of memory but use only a small fraction of that at any time. The kernel is smart enough to give programs what they need rather than what they ask for.

State The STATE column shows what a process is doing at the moment. A process can be in a variety of states—waiting for input, sleeping until something wakes it, actively running, and so on. You can see the name of the event a process is waiting on, such as select, pause, or ttyin. On an SMP system, when a process runs, you’ll see the CPU it’s running on.

Time The TIME column shows the total amount of CPU time the process has consumed.

WCPU The weighted CPU (WCPU) usage shows the percentage of CPU time that the process uses, adjusted for the process’s priority and niceness.

Command Finally, we have the name of the program that’s running.

Looking at top(1)’s output gives you an idea of where the system is spending its time.

Not every process on a host is actively engaged in work. You might have dozens or hundreds of daemons sitting idle. Enter i on a running top(1) display to toggle displaying idle processes, or use the -i command line flag. To show individual threads, either toggle H or add the -H flag.

By default, top sorts its output by weighted CPU usage. You can also sort output by priority, size, and resident memory. Enter o at a running top display. Enter the name of the column you want to sort by. This will help identify self-important programs or those using too much memory.

top(1) and I/O

In addition to the standard CPU display, top(1) has an I/O mode that displays which processes are using the disk most actively. While top(1) is running, hit m to enter the I/O mode. The upper portion of the display still shows memory, swap, and CPU status, but the lower portion changes considerably.

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
 3064 root           89      0     89      0      0     89 100.00% tcsh
  767 root            0      0      0      0      0      0   0.00% nfsd
 1082 mwlucas         2      1      0      0      0      0   0.00% sshd
 1092 root            0      0      0      0      0      0   0.00% tcsh
  904 root            0      0      0      0      0      0   0.00% sendmail
--snip--

The PID is the process ID, of course, and the USERNAME column shows who is running the process.

VCSW stands for voluntary context switches; this is the number of times this process has surrendered the system to other processes. IVCSW means involuntary context switches and shows how often the kernel has told the process, “You’re done now. It’s time to let someone else run for a while.”

Similarly, READ and WRITE show how many times the system has read from disk and written to disk. The FAULT column shows how often this process has had to pull memory pages from disk, which makes for another sort of disk read. These last three columns are aggregated in the TOTAL column.

The PERCENT column shows what percent of disk activity this process is using. Unlike gstat(8), top(1) displays each process’s utilization as a percentage of the actual disk activity, rather than the possible disk activity. If you have only one process accessing the disk, top(1) displays that process as using 100 percent of disk activity, even if it’s sending only a trickle of data. While gstat(8) tells you how busy the disk is, top(1) tells you what’s generating that disk activity and where to place the blame. Here, we see that process ID 3064 is generating all of our disk activity. It’s a tcsh(1) process, also known as “some user’s shell.” Let’s track down the miscreant.

Following Processes

On any Unix-like system, every userland process has a parent-child relationship with other processes. When FreeBSD boots, it creates a single process by starting init(8) and assigning it PID 1. This process starts other processes, such as the /etc/rc startup script and the getty(8) program that handles your login request. These processes are children of process ID 1. When you log in, getty(8) starts login(8), which fires up a new shell for you, making your shell a child of the login(8) process. Commands you run are either children of your shell process or part of your shell. You can view these parent-child relationships with ps(1) using the -ajx flags (among others).

# ps -ajx
USER      PID  PPID  PGID  SID JOBC STAT TT         TIME COMMAND
root        0     0     0    0    0 DLs   -      6:26.23 [kernel]
root        1     0     1    1    0 ILs   -      0:00.09 /sbin/init --
root        2     0     0    0    0 DL    -      0:00.00 [crypto]
--snip--
root      845     1   845  845    0 Is    -      0:00.00 /usr/sbin/sshd
root      849     1   849  849    0 Ss    -      0:01.05 /usr/sbin/cron -s
root     8632   845  8632 8632    0 Is    -      0:00.09 sshd: mwlucas [priv] (sshd)
mwlucas  8634  8632  8632 8632    0 S     -      0:00.53 sshd: mwlucas@pts/0,pts/1 (sshd)
mwlucas  8687     1  8687 8687    0 Ss    -      0:25.90 tmux: server (/tmp/tmux-1001/default) (tmux)
--snip--

At the far left, we have the username of the process owner and then the PID and parent PID (PPID) of the process. This is the most useful thing we see here, but we’ll briefly cover the other fields.

The PGID is the process group ID number, which is normally inherited from its parent process. A program can start a new process group, and that new process group will have a PGID equal to the process ID. Process groups are used for signal processing and job control. A session ID, or SID, is a grouping of PGIDs, usually started by a single user or daemon. Processes may not migrate from one SID to another. JOBC gives the job control count, indicating whether the process is running under job control (that is, in the background).

STAT shows the process state—exactly what the process is doing at the moment you run ps(1). Process state is very useful as it tells you whether a process is idle, what it’s waiting for, and so on. I highly recommend reading the section on process state from ps(1).

TT lists the process’s controlling terminal. This column shows only the end of the terminal name, such as v0 for ttyv0 or p0 for ttyp0. Processes without a controlling terminal are indicated by ??.

The TIME column shows how much processor time the process has used, both in userland and in the kernel.

Finally, we see the COMMAND name, as it was called by the parent process. Processes in square brackets are actually kernel threads, not real processes. FreeBSD runs a whole bunch of kernel threads.

So, how can this help us track a questionable process? In our top(1) I/O example, we saw that process 3064 was generating almost all of our disk activity. Run ps -ajx to look for this process:

   USER      PID  PPID  PGID  SID JOBC STAT TT         TIME COMMAND
   --snip--
   root     3035  3034  3035  2969    1 S+    p0    0:00.03 _su -m (tcsh)
bert     2981  2980  2981  2981    0 Is    p1    0:00.03 -tcsh (tcsh)
root     2989  2981  2989  2981    1 I     p1    0:00.01 su -m
root     2990  2989  2990  2981    1 D     p1    0:00.05 _su -m (tcsh)
root     3064  2990  3064  2981    1 DV+   p1    0:00.15 _su -m (tcsh)
   mwlucas  2996  2995  2996  2996    0 Is    p2    0:00.02 -tcsh (tcsh)
   --snip--

Our process of interest is owned by root and is a tcsh(1) instance , just as top’s I/O mode said. The command is running under su(1), however. Check this process’s parent process ID with the PPID column, and you’ll see that process 3064 is a child of process 2990 , which is a child of process 2989 , both of which are owned by root. Process 2989 is a child of 2981 , however, which is a shell run by a real user. You might also note that these processes are all parts of session 2981, showing that they’re probably all run in the same login session. The TT column shows p1, which means that the user is logged in on /dev/ttyp1, the second virtual terminal on this machine. Investigating that SID would illuminate just what Bert thought he was doing.

Now that you know how process parent-child operations work, you can cheat. Add the -d flag, as in, ps -ajxd, to present processes arranged in a tree with their parents. You’ll want a wide terminal.

It’s normal for a system to experience brief periods of total utilization. If nobody else is using the system and nobody’s complaining about performance, why not let this user run his job? If this process is causing problems for other users, however, we can either deprioritize it, use our root privileges to kill the job, or show up at the user’s cubicle with a baseball bat.

Paging and Swapping

Using swap space isn’t bad in and of itself. Swap space is much slower than chip memory, but it does work, and many programs don’t need to have everything in RAM in order to run. The old rule of thumb says that a typical program spends 80 percent of its time running 20 percent of its code. Much of the rest of its code covers startup and shutdown, error handling, and so on. You can safely let those bits go out of RAM with minimal performance impact.

Swap caches data that it has handled. Once a process uses swap, that swap remains in use until the process either exits or calls the memory back from swap.

Swap usage occurs through paging and swapping. Paging is all right; swapping is not so good, but it’s better than crashing.

Paging

Paging occurs when FreeBSD moves a portion of a running program into swap space. Paging can actually improve performance on a heavily loaded system because unused bits can be stored on disk until they’re needed—if ever. FreeBSD can then use the real memory for actual running code. Does it really matter whether your system puts your database startup code to swap once the database is up and running?

Swapping

If the computer doesn’t have enough physical memory to store a process that isn’t being run at that particular microsecond, the system can move the entire process to swap. When the scheduler starts that process again, FreeBSD fetches the entire process from swap and runs it, probably consigning some other process to swap.

The problem with swapping is that disk I/O activity goes through the roof and performance drops dramatically. Since requests take longer to handle, there are more requests on the system at any one time. Logging in to check the problem only makes the situation worse because your login is just one more process. Some systems can handle certain amounts of swapping, while on others, the situation quickly degenerates into a death spiral.

When your CPU is overloaded, the system is slow. When your disks are a bottleneck, the system is slow. Memory shortages can actually crash your computer. If you’re swapping, you must buy more memory or resign yourself to appalling performance. If you’re trapped into this hardware and can’t buy more memory, you might get a really fast SSD and use it for swap.

The output of vmstat(8) shows the number of processes swapped out at any one time.

Performance Tuning

FreeBSD caches recently accessed data in memory because a surprising amount of information is read from the disk time and time again. Information cached in physical memory can be accessed very quickly. If the system needs more memory, it dumps the oldest cached chunks in favor of new data. UFS and ZFS use different methods to decide what to cache, but the principle generally applies.

When I booted my desktop this morning, I started Firefox so I could check my RSS feeds. The disk worked for a moment or two to read in the program. I then shut the browser off so I could focus on my work, but FreeBSD left Firefox in the cache. If I restart Firefox, FreeBSD will pull it from memory instead of troubling the disk, which dramatically reduces its startup time. Had I started a process that demanded a whole bunch of memory, though, FreeBSD would have dumped the web browser from the cache to support the new program.

If your system is operating well, you’ll have at least a few megabytes of free memory. The sysctls vm.v_free_target and hw.pagesize tell you how much free memory FreeBSD thinks it needs on your system. If you consistently have more free memory than these two sysctls multiplied, your system isn’t being used to its full potential. For example, on my mail server I have:

# sysctl vm.v_free_target
vm.v_free_target: 5350
# sysctl hw.pagesize
hw.pagesize: 4096

My system wants to have at least 5,350 × 4,096 = 21,913,600 bytes, or about 22MB, of free memory. I could lose a gigabyte of RAM from my desktop without flinching, if it wasn’t for the fact that I suffer deep-seated emotional trauma about insufficient RAM.3

Memory Usage

If a host has a lot of memory in cache or buffer, or the ARC has eaten all its RAM, it doesn’t have a memory shortage. You might make good use of more memory, but it isn’t strictly necessary. If you have low free memory, but a lot of active and non-ZFS wired memory, your system is devouring RAM. Adding memory would let you take advantage of the buffer cache.

If the pagedaemon keeps running, incrementing the sr field in your vmstat output, the kernel is working hard to provide memory. The host might well have a memory shortage. Once the host start to use swap, though, this memory shortage is no longer hypothetical. It might not be bad, but it’s not theoretical.

Swap Space Usage

Swap space helps briefly cover RAM shortages. For example, if you’re untarring a huge file you might easily consume all of your physical memory and start using virtual memory. It’s not worth buying more RAM for such occasional tasks when swap suffices. If a memory-starved server runs a daemon that doesn’t ever get called, that daemon will eventually get mostly or entirely swapped out in favor of processes that are performing work.

Only worry about swap space use when the system constantly pages data in and out of swap.

In short, swap space is like wine. A glass or two now and then won’t hurt you and might even be a good choice. Hitting the bottle constantly is a problem. If you have to swap constantly, consider a really fast but durable SSD.

CPU Usage

A processor can do only so many things a second. If you run more tasks than your CPU can handle, requests will start to back up, you’ll develop a processor backlog, and the system will slow down. That’s CPU usage in a nutshell. If performance is unacceptable and top(1) shows your CPU hovering around 100 percent all the time, CPU utilization is probably your problem. While new hardware is certainly an option, you do have other choices. For example, investigate the processes running on your system to see whether they’re all necessary. Did some junior sysadmin install a SETI@Home client to hunt for aliens with your spare CPU cycles? How about a Bitcoin miner? Is anything running that was important at one time, but not any longer? Find and shut down those unnecessary processes, and make sure that they won’t start the next time the system boots.

If you have very specific needs, such as dedicating certain processors to specific tasks, consider cpuset(1). It’s overkill for most users, but a high-performance application might make good use of dedicated processors.

Once that’s done, evaluate your system performance again. If you still have problems, try rescheduling or reprioritizing.

Rescheduling

Rescheduling is easier than reprioritizing; it’s a relatively simple way to balance system processes so that they don’t monopolize system resources. As discussed in Chapter 20, you and your users can schedule programs to run at specific times with cron(8). If you have users who are running massive jobs at particular times, you might consider using cron(1) to run them in off hours. Frequently, jobs such as the monthly billing database search can run between 6 PM and 6 AM and nobody will care—Finance just wants the data on hand at 8 AM on the first day of the month so they can close out last month’s accounting. Similarly, you can schedule your make buildworld && make buildkernel at 1 AM.

Reprioritizing with Niceness

If rescheduling won’t work, you’re left with reprioritizing, which can be a little trickier. When reprioritizing, you tell FreeBSD to change the importance of a given process. For example, you can have a program run during busy hours, but only when nothing else wants to run. You’ve just told that program to be nice and step aside for other programs.

The nicer a process is, the less CPU time it demands. The default niceness is 0, but niceness runs from 20 (very nice) to -20 (not nice at all). This might seem backward; you could argue that a higher number should mean a higher priority. That would lead to a language problem, however; calling this factor “selfishness” or “crankiness” instead of “niceness” didn’t seem like a good idea at the time.4

The top(1) tool displays a PRI column for process priority. FreeBSD calculates a process’s priority from a variety of factors, including niceness, and runs high-priority processes first whenever possible. Niceness affects priority, but you can’t directly edit priority.

If you know that your system is running at or near capacity, you can choose to run a command with nice(1) to assign the process a niceness. Specify niceness with nice -n and the nice value in front of the command. For example, to start a very selfish make buildworld at nice 15, you’d run:

# nice -n 15 make buildworld

Only root can assign a negative niceness to a program, as in nice -n -5. For example, if you want to abuse your superuser privileges to make a compile finish as quickly as possible, use a negative niceness:

# nice -n -20 make

Usually, you don’t have the luxury of telling a command to be nice when you start it but instead have to change its niceness when you learn that it’s absorbing all of your system capacity. You can use renice(8) to reprioritize running processes by their process IDs or owners. To change the niceness of a process, run renice with the new niceness and the PID as arguments.

In my career, I’ve run several logging hosts. In addition to general syslog services, they usually also run several instances of flow-capture, Nagios, and other critical network awareness systems. I’ll often use a web interface to all of this and allow other people to access my logs. If I find that intermittent load on the web server is interfering with my network monitoring or my syslogd(8) server, I must take action. Renicing the web server makes clients run more slowly, but that’s better than slowing down monitoring. Use pgrep(1) to find the web server’s PID:

# pgrep httpd
993
# renice 10 993
993: old priority 0, new priority 10

Boom! FreeBSD now serves web requests after other processes. This greatly annoys the users of that service, but since it’s my server and I’m already annoyed, that’s all right.

To renice every process owned by a user, use the -u flag. For example, to make my processes more important than anyone else’s, I could do this:

# renice -5 -u mwlucas
1001: old priority 0, new priority -5

The 1001 is my user ID on this system. Again, presumably I have a very good reason for doing this, beyond my need for personal power.5 Similarly, if that user who gobbled up all my processor time insists on being difficult, I could make his processes very, very nice, which would probably solve other users’ complaints. If you have a big background database job, having the user running that job run nicely can let the foreground work proceed normally.

Niceness only affects CPU usage. It has no impact on disk or network activity.

Now that you can look at system problems, let’s learn how to hear what the system is trying to tell you.

Status Mail

FreeBSD runs maintenance jobs every day, week, and month, via periodic(8). These jobs perform basic system checks and notify the administrators of changes, items requiring attention, and potential security issues. The output of each scheduled job is mailed daily to the root account on the local system. The simplest way to find out what your system is doing is to read this mail; many very busy sysadmins just like you have collaborated to make these messages useful. While you might get a lot of these messages, with a little experience, you’ll learn how to skim the reports looking for critical or unusual changes only.

The configuration of the daily, weekly, and monthly reports is controlled in periodic.conf, as discussed in Chapter 20.

You probably don’t want to log in as root on all of your servers every day just to read email, so forward root’s mail from every server to a centralized mailbox. Make this change in /etc/mail/aliases, as discussed in Chapter 20.

The only place where I recommend disabling these jobs is on embedded systems, which should be managed and monitored through some other means, such as your network monitoring system. On such a system, disable the periodic(8) checks in /etc/crontab.

While these daily reports are useful, they don’t tell the whole story. Logs give a much more complete picture.

Logging with syslogd

The FreeBSD logging system is terribly useful. Any Unix-like operating system allows you to log almost anything at almost any level of detail. While you’ll find default system logging hooks for the most common system resources, you can choose a logging configuration that meets your needs. Almost all programs integrate with the logging daemon, syslogd(8).

The syslog protocol works through messages. Programs send individual messages, which the syslog daemon syslogd(8) catches and processes. syslogd(8) handles each message according to its facility and priority level, both of which client programs assign to messages. You must understand both facilities and levels to manage system logs.

Facilities

A facility is a tag indicating the source of a log entry. This is an arbitrary label, just a text string used to sort one program from another. In most cases, each program that needs a unique log uses a unique facility. Many programs or protocols have facilities dedicated to them—for example, FTP is such a common protocol that syslogd(8) has a special facility just for it. syslogd also supports a variety of generic facilities that you can assign to any program.

Here are the standard facilities and the types of information they’re used for.

auth Public information about user authorization, such as when people logged in or used su(1).

authpriv Private information about user authorization, accessible only to root.

console Messages normally printed to the system console.

cron Messages from the system process scheduler.

daemon A catch-all for all system daemons without other explicit handlers.

ftp Messages from FTP and TFTP servers.

kern Messages from the kernel.

lpr Messages from the printing system.

mail Mail system messages.

mark This facility puts an entry into the log every 20 minutes. This is useful when combined with another log.

news Messages from the Usenet News daemons.

ntp Network Time Protocol messages.

security Messages from security programs, such as pfctl(8).

syslog Messages from the log system about the log system itself. Don’t log when you log, however, as that just makes you dizzy.

user The catch-all message facility. If a userland program doesn’t specify a logging facility, it uses this.

uucp Messages from the Unix-to-Unix Copy Protocol. This is a piece of pre-internet Unix history that you’ll probably never encounter.

local0 through local7 These are provided for the sysadmin. Many programs have an option to set a logging facility; choose one of these if at all possible. For example, you might tell your customer service system to log to local0.

While most programs have sensible defaults, it’s your job as the sysadmin to manage which programs log to which facility.

Levels

A log message’s level represents its relative importance. While programs send all of their logging data to syslogd, most systems record only the important stuff that syslogd receives and discard the rest. Of course, one person’s trivia is another’s vital data, and that’s where levels come in.

The syslog protocol offers eight levels. Use these levels to tell syslogd what to record and what to discard. The levels are, in order from most to least important:

emerg System panic. Messages flash on every terminal. The computer is basically hosed. You don’t even have to reboot—the system is doing it for you.

crit Critical errors include things such as bad blocks on a hard drive or serious software issues. You can continue to run as is, if you’re brave.

alert This is bad, but not an emergency. The system can continue to function, but this error should be attended to immediately.

err These are errors that require attention at some point, but they won’t destroy the system.

warning These are miscellaneous warnings that probably won’t stop the program that issued them from working just as it always has.

notice This includes general information that probably doesn’t require action on your part, such as daemon startup and shutdown.

info This includes program information, such as individual transactions in a mail server.

debug This level is usually only of use to programmers and occasionally to sysadmins who are trying to figure out why a program behaves as it does. Debugging logs can contain whatever information the programmer considered necessary to debug the code, which might include information that violates user privacy.

none This means, “Don’t log anything from this facility.” It’s most commonly used to exclude information from wildcard entries, as we’ll see shortly.

By combining level with priority, you can categorize messages quite narrowly and treat each according to your needs.

Processing Messages with syslogd(8)

The syslogd(8) daemon catches messages from the network and compares them with entries in /etc/syslog.conf or files in /etc/syslog.d/. Files in /etc/syslog.d/ are for your own entries and add-on programs, while /etc/syslog.conf is for integrated system programs. Syslogd only reads /etc/syslog.d/ files ending in .conf. Both files have the same format, but I’ll refer to /etc/syslog.conf for clarity. That file has two columns; the first describes the log message, either by facility and level, or by program name. The second tells syslogd(8) what to do when a log message matches the description. For example, look at this entry from the default syslog.conf:

mail.info                                       /var/log/maillog

This tells syslogd(8) that when it receives a message from the mail facility with a level of info or higher, the message should be appended to /var/log/maillog.

The logger won’t log to a nonexistent file. Use touch(1) to create the log file before restarting syslogd(8).

Wildcards

You can also use wildcards as an information source. For example, this line logs every message from the mail facility:

mail.*                                       /var/log/maillog

To log everything from everywhere, uncomment the all.log entry and create the file /var/log/all.log:

*.*                                         /var/log/all.log

This works, but I find it too informative to be of any real use. You’ll find yourself using complex grep(1) statements daisy-chained together to find even the simplest information. Also, this would include all sorts of private data.

Excluding Information

Use the none level to exclude information from a log. For example, here, we exclude authpriv information from our all-inclusive log. The semicolon allows you to combine entries on a single line:

*.*; authpriv.none        /var/log/most.log

Comparison

You can also use the comparison operators < (less than), = (equal), and > (greater than) in syslog.conf rules. While syslogd defaults to recording all messages at the specified level or above, you might want to include only a range of levels. For example, you could log everything of info level and above to the main log file while logging the rest to the debug file:

mail.info                /var/log/maillog
mail.=debug              /var/log/maillog.debug

The mail.info entry matches all log messages sent to the mail facility at info level and above. The second line matches only the messages that have a level of precisely debug. You can’t use a simple mail.debug because the debugging log will then duplicate the content of the previous log. This way, you don’t have to sort through debugging information for basic mail logs, and you don’t have to sort through mail transmission information to get your debugging output.

Local Facilities

Many programs offer to log via syslog. Most of these can be set to a facility of your choice. The various local facilities are reserved for these programs. For example, by default, dhcpd(8) (see Chapter 20) logs to the facility local7. Here, we catch these messages and send them to their own file:

local7.*                /var/log/dhcpd

If you run out of local facilities, you can use other facilities that the system isn’t using. For example, I’ve once used the uucp facility on a busy log server on a network that had no uucp services.

Logging by Program Name

If you’re out of facilities, you can use the program’s name as a matching term. An entry for a name requires two lines: the first line contains the program name with a leading exclamation mark and the second line sets up logging. For example, FreeBSD uses this to log ppp(8) information:

!ppp
*.*                    /var/log/ppp.log

The first line specifies the program name and the second one uses wildcards to tell syslogd(8) to append absolutely everything to a file.

The !programname syntax affects all lines after it, so you must put it last in syslogd.conf. You can safely use it in an /etc/syslog.d file without worrying about affecting other entries.

Logging to User Sessions

When you log to a user, any messages that arrive appear on that user’s screen. To log to a user session, list usernames separated by commas as the destination. To write a message to all users’ terminals, use an asterisk (*). For example, the default syslog.conf includes this line:

*.emerg                *

This says that any message of emergency level will appear on all users’ terminals. Since these messages usually say “goodbye” in one way or another, that’s appropriate.

Sending Log Messages to Programs

To direct log messages to a program, use a pipe symbol (|):

mail.*                |/usr/local/bin/mailstats.pl

Logging to a Logging Host

My networks habitually have a single logging host that handles not only the FreeBSD boxes but also Cisco routers and switches, other Unix boxes, and any syslog-speaking appliances. This greatly reduces system maintenance and saves disk space. Each log message includes the hostname, so you can easily sort them out later.

Use the at symbol (@) to send messages to another host. For example, the following line dumps everything your local syslog receives to the logging host on my network:

*.*                  @loghost.blackhelicopters.org

The syslog.conf on the destination host determines the final destination for those messages.

On the logging host, you can separate logs by the host where the log message originated. Use the plus (+) symbol and the hostname to indicate that the rules that follow apply to this host:

+dhcpserver
local7.*            /var/log/dhcpd
+ns1
local7.*            /var/log/named

Put your generic rules at the top of syslog.conf. Per-host rules should go near the bottom or in separate syslog.d files.

Logging Overlap

The logging daemon doesn’t log on a first-match or last-match basis; instead, it logs according to every matching rule. This means you can easily have one log message in several different logs. Consider the following snippet of log configuration.

*.notice;authpriv.none        /var/log/messages
local7.*                      /var/log/dhcp

Almost every message of level notice or more is logged to /var/log/messages. Anything with a facility of authpriv is deliberately excluded from this log, though. We have our DHCP server logging to /var/log/dhcp. This means that any DHCP messages of notice level or above will be logged to both /var/log/messages and /var/log/dhcpd. I don’t like this; I want my DHCP messages only in /var/log/dhcpd. I can follow the authpriv example to deliberately exclude DHCP messages from /var/log/messages by using the none facility:

*.notice;authpriv.none;local7.none        /var/log/messages

My /var/log/messages syslog configuration frequently grows quite long as I incrementally exclude every local facility from it, but that’s all right.

syslogd Customization

FreeBSD runs syslogd by default, and out of the box it can be used as a logging host. You can customize how it works through the use of command line flags. You can specify flags either on the command line or in rc.conf as syslogd_flags.

Allowed Log Senders

You can specify exactly which hosts syslogd(8) accepts log messages from. This can be useful so you don’t wind up accepting logs from random people on the internet. While sending you lots of logs could be used to fill your hard drive as a preparation for an attack, it’s more likely to be the result of a misconfiguration. Your log server should be protected by a firewall in any case. Use the -a flag to specify either the IP addresses or the network of hosts that can send you log messages, as these two (mutually exclusive) examples show:

syslogd_flags="-a 192.168.1.9"
syslogd_flags="-a 192.168.1.0/24"

While syslogd(8) would also accept DNS hostnames and domain names for this restriction, DNS is a completely unsuitable access control mechanism.

You can entirely disable accepting messages from remote hosts by specifying the -s flag, FreeBSD’s default. If you use -ss instead, syslogd(8) also disables sending log messages to remote hosts. Using -ss removes syslogd(8) from the list of network-aware processes that show up in sockstat(1) and netstat(1). While this half-open UDP socket is harmless, some people feel better if syslogd(8) doesn’t appear attached to the network at all.

Attach to a Single Address

syslogd(8) defaults to attaching to UDP port 514 on every IP address the system has. Your jail server needs syslogd, but a jail machine can run only daemons that bind to a single address. Use the -b flag to force syslogd(8) to attach to a single IP:

syslogd_flags="-b 192.168.1.1"

Additional Log Sockets

syslogd(8) can accept log messages via Unix domain sockets as well as over the network. The standard location for this is /var/run/log. No chrooted processes on your system can access this location, however. If you want those chrooted processes to run, you must either configure them to log over the network or provide an additional logging socket for them. Use the -l flag for this and specify the full path to the additional logging socket:

syslogd_flags="-l /var/named/var/run/log"

The named(8) and ntpd(8) programs come with FreeBSD and are commonly chrooted. The /etc/rc.d/syslogd is smart enough to add the appropriate syslogd sockets if you chroot these programs through rc.conf.

Verbose Logging

Logging with verbose mode (-v) prints the numeric facility and level of each message written in the local log. Using doubly verbose logging prints the name of the facility and level instead of the number:

syslogd_flags="-vv"

These are the flags I consider most commonly. Read syslogd(8) for the complete list of options.

Log File Management

Log files grow, and you must decide how large they can grow before you trim them. The standard way to do this is through log rotation. When logs are rotated, the oldest log is deleted, the current log file is closed up and given a new name, and a new log file is created for new data. FreeBSD includes a basic log file processor, newsyslog(8), which also compresses files, restarts daemons, and in general handles all the routine tasks of log file shuffling. cron(1) runs newsyslog(8) once per hour.

When newsyslog(8) runs, it reads /etc/newsyslog.conf and the files in /etc/newsyslog.conf.d/. The /etc/newsyslog.conf file is for core system functions, while files in /etc/newsyslog.conf.d/ are for add-on software. The newsyslog program attempts to parse any files in /etc/newsyslog.conf.d/ as newsyslog configurations. Both use the same format, so we’ll refer to newsyslog.conf for clarity. Each line in newsyslog.conf gives the condition for rotating one log file. If the conditions for rotating the log are met, the log is rotated and other actions are taken as appropriate. /etc/newsyslog.conf uses one line per log file; each line has seven fields, like this:

/var/log/ppp.log        root:network    640  3     100  *     JC

Let’s examine each field in turn.

Log File Path

The first entry on each line (/var/log/ppp.log in the example) is the full path to the log file to be processed.

Owner and Group

The second field (root:network in our example) lists the rotated file’s owner and group, separated by a colon. This field is optional and isn’t present in many of the standard entries.

newsyslog(8) can change the owner and group of old log files. By default, log files are owned by the root user and the wheel group. While it’s not common to change the owner, you might need this ability on multiuser machines.

You can also choose to change only the owner or only the group. In these cases, you use a colon with a name on only one side of it. For example, :www changes the group to www, while mwlucas: gives me ownership of the file.

Permissions

The third field (640 in our example) gives the permissions mode in standard Unix three-digit notation.

Count

This field specifies the oldest rotated log file that newsyslog(8) should keep. newsyslog(8) numbers archived logs from newest to oldest, starting with the newest as log 0. For example, with the default count of 5 for /var/log/messages, you’ll find the following message logs:

messages
messages.0.bz
messages.1.bz
messages.2.bz
messages.3.bz
messages.4.bz
messages.5.bz

Those of you who can count will recognize that this makes six archives, not five, plus the current log file, for a week of logs. As a rule, it’s better to have too many logs than too few; however, if you’re tight on disk space, deleting an extra log or two might buy you time.

Size

The fifth field (100 in our example) is the file size in kilobytes. When newsyslog(8) runs, it compares the size listed here with the size of the file. If the file is larger than the size given here, newsyslog(8) rotates the file. If you don’t want the file size to affect when the file is rotated, put an asterisk here.

Time

So far this seems easy, right? The sixth field, rotation time, changes that. The time field has four different legitimate types of value: an asterisk, a number, and two different date formats.

If you rotate based on log size rather than age, put an asterisk here.

If you put a plain naked number in this field, newsyslog(8) rotates the log after that many hours have passed. For example, if you want the log to rotate every 24 hours but don’t care about the exact time when that happens, put 24 here.

The date formats are a little more complicated.

ISO 8601 Time Format

Any entry beginning with an @ symbol is in the restricted ISO 8601 time format. This is a standard used by newsyslog(8) on most Unix-like systems; it was the time format used in MIT’s primordial newsyslog(8). Restricted ISO 8601 is a bit obtuse, but every Unix-like operating system supports it.

A full date in the restricted ISO 8601 format is 14 digits with a T in the middle. The first four digits are the year, the next two the month, the next two the day of the month. The T is inserted in the middle as a sort of decimal point, separating whole days from fractions of a day. The next two digits are hours, the next two minutes, the last two seconds. For example, the date of March 2, 2008, 9:15 and 8 seconds PM is expressed in restricted ISO 8601 as 20080302T211508.

While complete dates in restricted ISO 8601 are fairly straightforward, confusion arises when you don’t list the entire date. You can choose to specify only fields near the T, leaving fields further away as blank. Blank fields are wildcards. For example, 1T matches the 1st day of every month. 4T00 matches midnight of the 4th day of every month. T23 matches the 23rd hour, or 11 PM, of every day. With a newsyslog.conf time of @T23, the log rotates every day at 11 PM.

As with cron(1), you must specify time units in detail. For example, @7T, the seventh day of the month, rotates the log once an hour, every hour, on the seventh day of the month. After all, it matches all day long! A time of @7T01 would rotate the log at 1 AM on the 7th day of the month, which is probably more desirable. You don’t need more detail than an hour, however, as newsyslog(8) runs only once an hour.

FreeBSD-Specific Time

The restricted ISO 8601 time system doesn’t allow you to easily designate weekly jobs, and it’s impossible to specify the last day of the month. That’s why FreeBSD includes a time format that lets you easily perform these common tasks. Any entry with a leading cash sign ($) is written in the FreeBSD-specific month week day format.

This format uses three identifiers: M (day of month), W (day of week), and H (hour of day). Each identifier is followed by a number indicating a particular time. Hours range from 0 to 23, while days run from 0 (Sunday) to 6 (Saturday). Days of the month start at 1 and go up, with L representing the last day of the month. For example, to rotate a log on the fifth of each month at noon I could use $M5H12. To start the month-end log accounting at 10 PM on the last day of the month, use $MLH22.

Flags

The flags field dictates any special actions to be taken when the log is rotated. This most commonly tells newsyslog(8) how to compress the log file, but you can also signal processes when their log is rotated out from under them.

Log File Format and Compression

Logs can be either text or binary files.

Binary files can be written to only in a very specific manner. newsyslog(8) starts each new log with a “logfile turned over” message, but adding this text to a binary file would damage it. The B flag tells newsyslog(8) that this is a binary file and that it doesn’t need this header.

Other log files are written in plain old ASCII text, and newsyslog(8) can and should add a timestamped message to the top of the file indicating when the log was rotated. If you’re using UFS, compressing old log files can save considerable space. The -J flag tells newsyslog(8) to compress archives with bzip(1); the -Z flag specifies gzip compression; the -X flag, xz(1); and the -Y flag, the new hotness in compression, zstd(1).

If you’re using ZFS, though, text log files get compressed at the dataset layer along with every other compressible file. You can compress the log files in the traditional manner anyway, but there’s no advantage to doing so. Plus, you’ll need to manually decompress the files before you can view them. Let ZFS handle compression for you.

Special Log File Handling

When it creates and rotates log files, newsyslog(8) can perform a few special tasks. Here are the most common; you can read about the others in newsyslog.conf(5).

Perhaps you have many similar log files that you want to treat identically. The -G flag tells newsyslog that the log file name at the beginning of the line is actually a shell glob, and that all log files that match the expression are to be rotated in this manner. To learn about shell expressions, read glob(3). Bring aspirin.

You might want newsyslog to create a file if it doesn’t exist. Use the -C flag for this. The syslogd program won’t log to a nonexistent file.

The -N flag explicitly tells newsyslog not to send a signal when rotating this log.

Finally, use a hyphen (-) as a placeholder when you don’t need any of these flags. It creates a column here so that you can have, say, a pidfile path.

Pidfile

The next field is a pidfile path (not shown in our example, but look at /etc/newsyslog.conf for a couple of samples). A pidfile records a program’s process ID so that other programs can easily view it. If you list the full path to a pidfile, newsyslog(8) sends a kill -HUP to that program when it rotates the log. This signals the process to close its logfiles and restart itself. Not all processes have pidfiles, and not all programs need this sort of special care when rotating their logs.

Signal

Most programs perform logfile rotation on a SIGHUP, but some programs need a specific signal when their logs are rotated. You can list the exact signal necessary in the last field, after the pidfile.

Sample newsyslog.conf Entry

Let’s slap all this together into a worst-case, you’ve-got-to-be-kidding example. A database log file needs rotation at 11 PM on the last day of the month. The database documentation says that you must send the server an interrupt signal (SIGINT, or signal number 2) on rotation. You want the archived logs to be owned by the user dbadmin and viewable only to that user. You need six months of logs. What’s more, the logs are binary files. Your newsyslog.conf line would look like this:

/var/log/database    dbadmin:   400   6   *   $MLH23   B   /var/run/db.pid   2

This is a deliberately vile example; in most cases, you just slap in the filename and the rotation condition, and you’re done.

FreeBSD and SNMP

Emailed reports are nice but general, and logs are difficult to analyze for long-term trends. The industry standard for network, server, and service management is Simple Network Management Protocol (SNMP). Many different vendors support SNMP as a protocol for gathering information from many different devices across the network. FreeBSD includes an SNMP agent, bsnmpd(8), that not only provides standard SNMP functions but also gives visibility to FreeBSD-specific features.

FreeBSD’s bsnmpd (short for Begemot SNMPD) is a minimalist SNMP agent specifically designed to be extensible. All actual functionality is provided via external modules. FreeBSD includes the bsnmpd modules for standard network SNMP functions and modules for specific FreeBSD features, such as PF and netgraph(4). Rather than trying to be all things to all people, bsnmpd(8) offers a foundation where everyone can build an SNMP implementation that does only what they need, no more and no less.

SNMP 101

SNMP works on a classic client-server model. The SNMP client, usually some kind of management workstation or monitoring server, sends a request across the network to an SNMP server. The SNMP server, also called an agent, gathers information from the local system and returns it to the client. FreeBSD’s SNMP agent is bsnmpd(8).

An SNMP agent can also send a request to make changes to the SNMP server. If the system is properly (or improperly, depending on your point of view) configured, you can issue commands via SNMP. This “write” configuration is most commonly used in routers, switches, and other embedded network devices. Most Unix-like operating systems have a command line management system and don’t usually accept instruction via SNMP. Writing system configuration or issuing commands via SNMP requires careful setup and raises all sorts of security issues; it’s an excellent topic for an entire book. No sysadmin I know is comfortable managing their system via SNMP. With all of this in mind, we’re going to focus specifically on read-only SNMP.

In addition to having an SNMP server answer requests from an SNMP client, the client can transmit SNMP traps to a trap receiver elsewhere on the network. An SNMP agent generates these traps in response to particular events on the server. SNMP traps are much like syslogd(8) messages, except that they follow the very specific format required by SNMP. FreeBSD doesn’t include an SNMP trap receiver at this time; if you need one, check out snmptrapd(8) from net-snmp (net-mgmt/net-snmp).

SNMP MIBs

SNMP manages information via a management information base (MIB), a tree-like structure containing hierarchical information in ASN.1 format. We’ve seen an example of an MIB tree before: the sysctl(8) interface discussed in Chapter 6.

Each SNMP server has a list of information it can extract from the local computer. The server arranges these bits of information into a hierarchical tree. Each SNMP MIB tree has very general main categories: network, physical, programs, and so on, with more specific subdivisions in each. Think of the tree as a well-organized filing cabinet, where individual drawers hold specific information and files within drawers hold particular facts. Similarly, the uppermost MIB contains a list of MIBs beneath it.

MIBs can be referred to by name or by number. For example, here’s an MIB pulled off a sample system:

interfaces.ifTable.ifEntry.ifDescr.1 = STRING: "em0"

The first term in this MIB, interfaces, shows us that we’re looking at this machine’s network interfaces. If this machine had no interfaces, this first category wouldn’t even exist. The ifTable is the interface table, or a list of all the interfaces on the system. ifEntry shows one particular interface, and ifDescr means that we’re looking at a description of this interface. This MIB can be summarized as, “Interface number 1 on this machine is called em0.”

MIBs can be expressed as numbers, and most SNMP tools do their work natively in numerical MIBs. Most people prefer words, but your poor brain must be capable of working with either. An MIB browser can translate between the numerical and word forms of an SNMP MIB for you, or you could install net-mgmt/net-snmp and use snmptranslate(1), but for now, just trust me. The preceding example can be translated to:

.1.3.6.1.2.1.2.2.1.2.1

Expressed in words, this MIB has 5 terms separated by dots. Expressed in numbers, the MIB has 11 parts. That doesn’t look quite right if they’re supposed to be the same thing. What gives?

The numerical MIB is longer because it includes the default .1.3.6.1.2.1, which means .iso.org.dod.internet.mgmt.mib-2. This is the standard subset of MIBs used on the internet. The vast majority of SNMP MIBs (but not all) have this leading string in front of them, so nobody bothers writing it down any more.

If you’re in one of those difficult moods, you can even mix words and numbers:

.1.org.6.1.mgmt.1.interfaces.ifTable.1.2.1

At this point, international treaties permit your coworkers to drive you from the building with pitchforks and flaming torches. Pick one method of expressing MIBs and stick to it.

MIB Definitions and MIB Browsers

MIBs are defined according to a very strict syntax and are documented in MIB files. Every SNMP agent has its own MIB files; bsnmpd’s are in /usr/share/snmp. These files are very formal plaintext. While you can read and interpret them with nothing more than your brain, I highly recommend copying them to a workstation and installing an MIB browser so that you can comprehend them more easily.

MIB browsers interpret MIB files and present them in their full tree-like glory, complete with definitions of each part of the tree and descriptions of each individual MIB. Generally speaking, an MIB browser lets you enter a particular MIB and displays both the numerical and word definitions of that MIB, along with querying an SNMP agent for the status of that MIB.

If you have FreeBSD (or a lesser Unix) on your workstation, use mbrowse (net-mgmt/mbrowse) for MIB browsing. If you don’t want to use a graphical interface for SNMP work, check out net-snmp (net-mgmt/net-snmp) for a full assortment of command line SNMP client tools.

SNMP Security

Many security experts state that SNMP really stands for “Security: Not My Problem!” This is rather unkind but very true. SNMP needs to be used only behind firewalls on trusted networks. If you must use SNMP on the naked internet, use packet filtering to keep the public from querying your SNMP service. SNMP agents run on UDP port 161.

The more common SNMP versions, 1 and 2c, provide no encryption. This means that anyone with a packet sniffer can capture your SNMP community name, so be absolutely certain you’re using SNMP only on a private network. Making unencrypted SNMP queries over an untrusted network is a great way to have strangers poking at your system management. SNMP version 3 uses encryption to protect data on the wire.

SNMP provides basic security through communities. If you go looking around, you’ll find all sorts of explanations for why a community isn’t the same thing as a password, but a community is a password. Most SNMP agents have two communities by default: public (read-only access) and private (read-write access). Yes, there’s a default that provides read-write access. Your first task whenever you provision an SNMP agent on any host, on any OS, is to disable those default community names and replace them with ones that haven’t been widely documented for decades.

FreeBSD’s bsnmpd(8) defaults to SNMPv2c but can do SNMPv3. SNMPv3 is a more complicated protocol, so we’re not going to cover it here. If you understand the SNMPv3 protocol and the basics of configuring FreeBSD’s bsnmpd, you won’t have any trouble enabling SNMPv3 in bsnmpd.

Configuring bsnmpd

Before you can use SNMP to monitor your system, you must configure the SNMP daemon. Configure bsnmpd(8) in /etc/snmpd.config. In addition to including the default communities of public and private, the default configuration doesn’t enable any of the FreeBSD-specific features that make bsnmpd(8) desirable.

bsnmpd Variables

bsnmpd uses variables to assign values to configuration statements. Most high-visibility variables are set at the top of the configuration file, as you’ll see here:

location := "Room 200"
contact := "[email protected]"
system := 1     # FreeBSD
traphost := localhost
trapport := 162

These top variables define values for MIBs that should be set on every SNMP agent. The location describes the physical location of the machine. Every system needs a legitimate email contact. bsnmpd(8) runs on operating systems other than FreeBSD, so you have the option of setting a particular operating system here. Lastly, if you have a trap host, you can set the server name and port here.

Further down the file, you can set the SNMP community names:

# Change this!
read := "public"
# Uncomment begemotSnmpdCommunityString.0.2 below that sets the community
# string to enable write access.
write := "geheim"
trap := "mytrap"

The read string defines the read-only community of this SNMP agent. The default configuration file advises you to change it. Take that advice. The write string is the read-write community name, which is disabled by default further down in the configuration file. You can also set the community name for SNMP traps sent by this agent.

With only this configuration, bsnmpd(8) will start, run, and provide basic SNMP data for your network management system. Just set bsnmpd_enable="YES" in /etc/rc.conf to start bsnmpd at boot. You won’t get any special FreeBSD functionality, however. Let’s go on and see how to manage this.

Detailed bsnmpd Configuration

bsnmpd(8) uses the variables you set at the top of the configuration file to assign values to different MIBs later in the configuration. For example, at the top of the file you set the variable read to public. Later in the configuration file, you’ll find this statement:

begemotSnmpdCommunityString.0.1 = $(read)

This sets the MIB begemotSnmpdCommunityString.0.1 equal to the value of the read variable.

Why not just set these values directly? bsnmpd(8) is specifically designed to be extensible and configurable. Setting a few variables at the top of the file is much easier than directly editing the rules further down the file.

Let’s go back to this begemotSnmpdCommunityString MIB set here. Why are we setting this? Search for the string in your MIB browser, and you’ll see that this is the MIB that defines an SNMP community name. You probably could have guessed that from the assignment of the read variable, but it’s nice to confirm that.

Similarly, you’ll find an entry like this:

begemotSnmpdPortStatus.0.0.0.0.161 = 1

Checking the MIB browser shows that this dictates the IP address and the UDP port that bsnmpd(8) binds to (in this case, all available addresses, on port 161). All MIB configuration is done in this manner.

Loading bsnmpd Modules

Most interesting bsnmpd(8) features are configured through modules. Enable modules in the configuration file by giving the begemotSnmpdModulePath MIB a class that the module handles and the full path to the shared library that implements support for that feature. For example, in the default configuration, you’ll see a commented-out entry for the PF bsnmpd(8) module:

begemotSnmpdModulePath."pf"    = "/usr/lib/snmp_pf.so"

This enables support for PF MIBs. Your network management software will be able to see directly into PF when you enable this, letting you track everything from dropped packets to the size of the state table.

As of this writing, FreeBSD’s bsnmpd(8) ships with the following modules included but disabled. Some are FreeBSD-specific, while others support industry standards. Enable these by uncommenting their configuration file entries and restarting bsnmpd.

lm75 Provides data from the lm75(4) temperature sensor via SNMP.

Netgraph Provides visibility into all Netgraph-based network features, documented in snmp_netgraph(3).

PF Provides visibility into the PF packet filter.

Hostres Implements the Host Resources SNMP MIB, snmp_hostres(3).

bridge Provides visibility into bridging functions, documented in snmp_bridge(3).

wlan Accesses information on wireless networking.

Restart bsnmpd(8) after enabling any of these in the configuration file. If the program won’t start, check /var/log/messages for errors.

With bsnmpd(8), syslogd(8), status emails, and a wide variety of performance analysis tools, you can make your FreeBSD system the best-monitored device on the network. Now that you can see everything your system offers, grab a flashlight as we explore a few of FreeBSD’s darker corners.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset