11
THE UNIX FILE SYSTEM

image

FreeBSD’s filesystem, the Unix File System (UFS), is a direct descendant of the filesystem shipped with BSD 4.4. One of the original UFS authors still develops the FreeBSD filesystem and has added many nifty features in recent years. FreeBSD is not the only operating system to still use the 4.4 BSD filesystem or a descendant thereof. A Unix vendor that doesn’t specifically tout its “improved and advanced” filesystem is probably running a UFS derivative.

UFS’s place as the primordial filesystem has given it leave to extend tendrils throughout FreeBSD. Many UFS concepts underlie FreeBSD’s support for other filesystems, from ZFS to optical disks. Even if you have no intention of ever using UFS, you must understand the basics of UFS to understand how FreeBSD manages filesystems.

Like the rest of Unix, UFS is designed to handle the most common situations effectively while reliably supporting unusual configurations. FreeBSD ships with UFS configured to be as widely useful as possible on relatively modern hardware, but you can choose to optimize a particular filesystem for trillions of small files or a half-dozen 1TB files if you must.

What we call UFS today is actually UFS version 2, or UFS2. Primordial UFS can’t handle modern disk sizes.

UFS is best suited for smaller systems, or applications that can’t handle the overhead of ZFS. Many people prefer UFS for virtual machines. I discuss choosing a filesystem in Chapter 2.

UFS Components

UFS is built of two layers, one called the Unix File System and the other the Fast File System (FFS). UFS handles items like filenames, attaching files to directories, permissions, and all of those petty details users care about. FFS does the real work in getting files written to disk and arranging them for quick access. The two work together to provide data storage.

The Fast File System

FFS is built of superblocks, blocks, fragments, and inodes.

A superblock records the filesystem’s characteristics. It contains a magic number that identifies the filesystem as UFS, as well as filesystem geometry information the kernel uses to optimize writing and reading files. A UFS filesystem keeps many backup copies of the superblock, in case the primary gets damaged.

Blocks are segments of disk that contain data. FreeBSD defaults to 32KB blocks. FFS maps blocks onto specific sectors on the underlying disk or GEOM provider. Every stored file gets broken up into 32KB chunks, and each chunk is stored in its own block.

Not all files are even multiples of 32KB, so FFS stores the leftovers in fragments. The standard is one-eighth of the block size, or 4KB. For example, a 39KB file would fill one block and two fragments. One of those fragments has only 3KB in it, so fragments do waste disk space—but they waste far less space than using full blocks everywhere.

How UFS Uses FFS

UFS allocates certain FFS blocks as inodes, or index nodes, to map blocks and fragments to files. An inode contains each file’s size, permissions, and the list of blocks and fragments containing each file. Collectively, the data in an inode is known as metadata, or data about data.

Each filesystem has a certain number of inodes, proportional to the filesystem size. A modern disk probably has hundreds of thousands of inodes on each partition, enough to support hundreds of thousands of files. If you have a truly large number of very tiny files, however, you might need to rebuild your filesystem to support additional inodes. Use df -i to see how many inodes remain free on your filesystem.

Theoretically, it was possible to run UFS on a storage layer other than FFS. That’s how many log-based or extent-based filesystems work. Over decades of development, though, UFS features like journaling and soft updates have so greatly entangled FreeBSD’s UFS and FFS that separating the two is no longer realistic or even vaguely plausible.

Vnodes

Inodes and blocks worked perfectly if the only filesystem you used was UFS and all your hard drives were permanently attached. These days, we routinely swap disks between different machines and even different operating systems. You probably need to read optical media and flash disks on your desktop, and servers might even need to accept hard drives formatted for a different operating system.

FreeBSD uses a storage abstraction layer—the virtual node, or vnode—to mediate between filesystems and the kernel. You’ll never directly manipulate a vnode, but the FreeBSD documentation frequently refers to them. Vnodes are a translation layer between the kernel and whatever filesystem you’ve mounted. If you’re an object-oriented programmer, think of a vnode like a base class that all storage classes inherit. When you write a file to a UFS filesystem, the kernel addresses the data to a vnode that, in turn, is mapped to a UFS inode and FFS blocks. When you write to a FAT32 filesystem, the kernel addresses data to a vnode that’s mapped to a specific part of the FAT32 filesystem. Use inodes only when dealing with UFS filesystems, but use vnodes when dealing with any filesystem.

Mounting and Unmounting Filesystems

The mount(8) program’s main function is attaching filesystems to a host’s filesystem tree. While FreeBSD mounts every filesystem listed in /etc/fstab at boot time, you must understand how mount(8) works. If you’ve never played with mounting before, boot your FreeBSD test machine into the single-user mode (see Chapter 4) and follow along.

In single-user mode, FreeBSD has mounted the root partition read-only. On a traditional Unix-like system, the root partition contains just enough of the system to perform basic setup, get core services running, and find the rest of the filesystems. Other filesystems aren’t mounted, so their content is inaccessible. The current FreeBSD installer puts everything in the root partition, so you’d get the basic operating system, but any special filesystems, network mounts, and so on would be empty. You might need to mount other filesystems to perform your system maintenance.

Mounting Standard Filesystems

To manually mount a filesystem listed in /etc/fstab, such as /var or /usr, give mount(8) the name of the filesystem you want to mount.

# mount /media

This mounts the partition exactly as listed in /etc/fstab, with all the options specified in that file. If you want to mount all the partitions listed in /etc/fstab, except those labeled noauto, use mount’s -a flag.

# mount -a

When you mount all filesystems, filesystems that are already mounted don’t get remounted.

Special Mounts

You might need to mount a filesystem at an unusual location or mount something temporarily. I most commonly mount disks manually when installing a new disk. Use the device node and the desired mount point. If my /var/db partition is /dev/gpt/db and I want to mount it on /mnt, I would run:

# mount /dev/gpt/db /mnt

Unmounting a Partition

When you want to disconnect a filesystem from the system, use umount(8) to tell the system to unmount the partition. (Note that the command is umount, not unmount.)

# umount /usr

You cannot unmount filesystems that are in use by any program. If you cannot unmount a partition, you’re probably accessing it somehow. Even a command prompt in the mounted directory prevents you from unmounting the underlying partition. Running fstat | grep /usr (or whatever the partition is) can expose the blocking program.

UFS Mount Options

FreeBSD supports several mount options that change filesystem behavior. When you manually mount a partition, you can specify any mount option with -o.

# mount -o ro /dev/gpt/home /home

You can also specify mount options in /etc/fstab (see Chapter 10). Here, I use the ro option on the /home filesystem, just as in the preceding command line.

/dev/gpt/home /home ufs ro 2 2

The mount(8) man page lists all of the UFS mount options, but here are the most commonly used ones.

Read-Only Mounts

If you want to look at the contents of a disk but disallow changing them, mount the partition read-only. You cannot alter the data on the disk or write any new data. In most cases, this is the safest and the most useless way to mount a disk.

Many system administrators want to mount the root partition, and perhaps even /usr, as read-only to minimize potential system damage from an intruder or malicious software. This maximizes system stability but vastly complicates maintenance. If you use an automatic deployment system, such as Ansible or Puppet, and habitually redeploy your servers from scratch rather than upgrading them, read-only mounts might be a good fit for you.

Read-only mounts are especially valuable on a damaged computer. While FreeBSD won’t let you perform a standard read-write mount on a damaged or dirty filesystem, it will perform a read-only mount if the filesystem isn’t too badly fubar. This gives you a chance to recover data from a dying disk.

To mount a filesystem read-only, use either the rdonly or ro option. Both work identically.

Synchronous Mounts

Synchronous (or sync) mounts are the old-fashioned way of mounting filesystems. When you write to a synchronously mounted disk, the kernel waits to see whether the write is actually completed before informing the program. If the write didn’t complete successfully, the program can choose to act accordingly.

Synchronous mounts provide the greatest data integrity in the case of a crash, but they’re also slow. Admittedly, “slow” is relative today, when even a cheap disk outperforms what was the high end several years ago. Consider using synchronous mounting when you wish to be truly pedantic on data integrity, but in almost all cases, it’s overkill.

To mount a partition synchronously, use the option sync.

Asynchronous Mounts

While asynchronous mounts are pretty much supplanted by soft updates (see “Soft Updates” on page 237), you’ll still hear about them. For faster data access at higher risk, mount your partitions asynchronously. When a disk is asynchronously mounted, the kernel writes data to the disk and tells the writing program that the write succeeded without waiting for the disk to confirm that the data was actually written.

Asynchronous mounting is fine on disposable filesystems, such as memory file systems that disappear at shutdown, but don’t use it with important data. The performance difference between asynchronous mounts and noasync with soft updates is minuscule. (I’ll cover noasync in the next section.)

To mount a partition asynchronously, use the option async.

Combining Sync and Async

FreeBSD’s default UFS mount option combines sync and async mounts as noasync. With noasync, data that affects inodes is written to the disk synchronously, while actual data is handled asynchronously. Combined with soft updates (see later in this chapter), a noasync mount creates a very robust filesystem.

As noasync mounts are the default, you don’t need to specify it when mounting, but when someone else does, don’t let it confuse you.

Disable Atime

Every file in UFS includes an access-time stamp, called the atime, which records when the file was last accessed. If you have a large number of files and don’t need this data, you can mount the disk noatime so that UFS doesn’t update this timestamp. This is most useful for flash media or disks that suffer from heavy load, such as Usenet news spool drives. Some software uses the atime, though, so don’t disable it blindly.

Disable Execution

Your policy might say that certain filesystems shouldn’t have executable programs. The noexec mount option prevents the system from executing any programs on the filesystem. Mounting /home noexec can help prevent users from running their own programs, but for it to be effective, also mount /tmp, /var/tmp, and anywhere else users can write their own files noexec as well.

A noexec mount doesn’t prevent a user from running a shell script or an interpreted script in Perl or Python or whatever. While the script might be on a noexec filesystem, the interpreter usually isn’t.

Another common use for a noexec mount is when you have a filesystem that contains binaries for a different operating system or a different hardware architecture and you don’t want anyone to execute them.

Disable Suid

Setuid programs allow users to run programs as if they’re another user. For example, programs such as login(1) must perform actions as root but must be run by regular users. Setuid programs obviously must be written carefully so that intruders can’t exploit them to get unauthorized access to your system. Many system administrators habitually disable all unneeded setuid programs.

The nosuid option disables setuid access from all programs on a filesystem. As with noexec, script wrappers can easily evade nosuid restrictions.

Disable Clustering

FFS optimizes reads and writes on the physical media by clustering. Rather than scattering a file all over the hard drive, it writes out the whole thing in large chunks. Similarly, it makes sense to read files in larger chunks. You can disable this feature with the mount options noclusterr (for read clustering) and noclusterw (for write clustering).

Disable Symlinks

The nosymfollow option disables symlinks, or aliases to files. Symlinks are mainly used to create aliases to files that reside on other partitions. To create an alias to another file on the same partition, use a regular link instead. See ln(1) for a discussion of links.

Aliases to directories are always symlinks; you cannot use a hard link for those.

UFS Resiliency

UFS dates from the age when a power loss meant data loss. After decades of use and debugging, UFS almost never loses data, especially when compared with other open source filesystems. UFS achieves this resiliency by careful integrity checking, especially after an unexpected shutdown like a power failure.

The point of resiliency isn’t to verify the data on disk—UFS is pretty good at that. It’s to speed integrity verification and filesystem recovery after that unexpected shutdown. The size of modern disks means that verification can take a long time without additional resiliency. An integrity check of a 100MB filesystem is much faster than the same integrity check of a multiterabyte filesystem! Adding resiliency improves recovery times.

UFS offers several ways to improve the resilience of a UFS filesystem, such as soft updates and journaling. Before creating a filesystem, choose one that fits your needs.

Soft Updates

Soft updates is a technology used to organize and arrange disk writes so that filesystem metadata remains consistent at all times, giving nearly the performance of an async mount with the reliability of a sync mount. That doesn’t mean that all data will be safely written to disk—a power failure at the wrong moment can still lose data. The file being written to disk at the exact millisecond the power dies can’t get to the disk no matter what the operating system does. But what’s actually on the disk will be internally consistent. Soft updates lets UFS quickly recover from failure.

You can enable and disable soft updates when mounting or creating the filesystem.

As filesystems grow, soft updates show their limits. Multiterabyte filesystems still need quite a while to recover from an unplanned shutdown. The original soft updates journaling paper (http://www.mckusick.com/softdep/suj.pdf) mentions that a 92 percent full 14-drive array with a deliberately damaged filesystem needed 10 hours for integrity checking. You’ll need a journal well before then.

Soft Updates Journaling

A journaling filesystem records any changes outside the actual filesystem. Changes get quickly dumped to storage and then inserted into the filesystem at a more leisurely pace. If the system dies unexpectedly, the filesystem automatically recovers any changes from the journal. This vastly reduces the requirement for rebuilding filesystem integrity at startup. When you install FreeBSD, it defaults to creating UFS partitions with soft update journaling.

Rather than recording all transactions, the soft updates journal records all metadata updates so that the filesystem can always be restored to an internally consistent state. Benchmarks show that journaling adds only a tiny amount of load to soft updates. It does add I/O overhead, however, as the system must dump all changes to the journal and then replay them into the filesystem. It vastly reduces recovery time, however. That 14-drive array that needed 10 hours for integrity checking? It needed less than one minute to recover from the same damage using the journal.

Soft updates with journaling is very powerful. Why wouldn’t you always use journaling? Soft updates journaling disables UFS snapshots. If you need UFS snapshots, you can’t journal. If you need snapshots, though, you’re probably better off using ZFS anyway. FreeBSD’s version of dump(8) uses UFS snapshots to back up live filesystems. Only us old Unix hands use dump any more, and that’s mostly because we already know it, but if your organization mandates using dump(8), you need another resiliency option.

GEOM Journaling

FreeBSD can also journal at the GEOM level with gjournal(8). Like any other filesystem journal, gjournal records filesystem transactions. At boot, FreeBSD checks the journal file for any changes not yet written to the filesystem and makes those changes, ensuring a consistent filesystem. Gjournal predates soft updates journaling.

While soft updates journals only metadata, gjournal journals all filesystem transactions. You’re less likely to lose data in a system failure, but everything gets written twice, which impacts performance. If you’re using gjournal, though, don’t use any type of soft updates. You should also mount the filesystem async. You can use snapshots on a gjournaled filesystem.

Gjournal uses 1GB of disk per filesystem. You can’t just turn it on and off—you must have space for the journal. You can use a separate partition for the journal or include the gigabyte in the partition if you leave space for it. If you decide to add gjournal to an existing partition, you need to find the space somewhere.

Should you use gjournal or soft updates journaling? I recommend using soft updates journaling if at all possible. If that isn’t an option, use plain soft updates. Use GEOM journaling if you need UFS snapshots, including dump(8) on snapshots. Personally, I no longer use gjournal.

Creating and Tuning UFS Filesystems

In the last chapter, we partitioned and labeled your disks. Now let’s put a filesystem on those partitions. Create UFS filesystems with newfs(8), using a device node as the last argument. Here, I create a filesystem on the device /dev/gpt/var:

# newfs /dev/gpt/var
/dev/gpt/var: 51200.0MB (104857600 sectors) block size 32768, fragment size 4096
        using 82 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872,
--snip--

The first line repeats the device node and prints the partition’s size , along with the block and fragment sizes . You’ll get filesystem geometry information , a relic of the days when disk geometry bore some relationship to the hardware. Finally, newfs(8) prints a list of super-block backups . The larger your filesystem, the more backup superblocks you get.

If you want to use soft updates journaling, add the -j flag. To use soft updates without journaling, add the -U flag. After you’ve created the filesystem, you can enable and disable soft updates journaling, and plain soft updates, with tunefs(8).

UFS Labeling

Device nodes can change, but labels remain constant. Best practice is to label GPT partitions, but you can’t label MBR partitions. UFS filesystems on an MBR can use a UFS label with the -L flag.

# newfs -L var /dev/ada3s1d

The labels appear in /dev/ufs. Use them in /etc/fstab and other configuration files to avoid disk renaming mayhem. You can’t apply UFS labels to non-UFS filesystems.

If you’re using UFS on GPT partitions, choose either GPT or UFS labels. Thanks to withering, you’ll see only one label at a time and probably confuse yourself.

Block and Fragment Size

UFS’s efficiency is proportional to the number of blocks and fragments read or written. Generally, FreeBSD can read a 10-block file in half the time it needs to read a 20-block file. The FreeBSD developers chose the default block and fragment sizes to accommodate the widest variety of files.

If you have a special-purpose filesystem that overwhelmingly contains either large or small files, you might consider changing the block size when creating the filesystem. While you can change the block size of an existing filesystem, it’s a terrible idea. Block sizes must be a power of 2. The assumption that a fragment is one-eighth the size of a block is hardcoded in many places, so let newfs(8) compute the fragment size from the block size.

Suppose I have a filesystem dedicated to large files, and I want to increase the block size. The default block size is 32KB, so the next larger block size would be 64KB. Specify the new block size with -b.

# newfs -b 64K -L home /dev/da0s1d

If you’re going to have many small files, you might consider using a smaller block size. One thing to watch out for is a fragment size smaller than the underlying disk’s physical sector size. FreeBSD defaults to 4KB fragments. If your disk has 4KB sectors, don’t use a smaller fragment size. If you’re absolutely certain that your disk has 512-byte physical sectors, you can consider creating a filesystem with a 16KB (or even 8KB) block size and the corresponding 2KB or 1KB fragment size.

In my sysadmin career, I have needed1 a custom block size only twice. Don’t use one until you experience a performance issue.

Using GEOM Journaling

Before using gjournal(8), decide where you’re putting the 1GB journal. If possible, I’d recommend including that gigabyte in the filesystem partition. That means if you want a 50GB filesystem, put it in a 51GB partition. Otherwise, use a separate partition.

Load the geom_journal kernel module with gjournal load or in /boot/loader.conf before performing any gjournal operations.

To create a gjournal provider while including the partition in the journal, use the gjournal label command.

# gjournal label da3p5

If you want to have a separate provider be the journal, add that provider as a second argument.

# gjournal label da3p5 da3p7

These commands run silently if successful. They create a new device node with the same name as your journaled device, but with .journal added to the end. Running gjournal label da3p5 creates /dev/da3p5.journal. From this point on, do all work on the journaled device node.

Create your new UFS filesystem on the journaled device. Use the -J flag to tell UFS it’s running on top of gjournal. Do not enable any sort of soft updates, including soft updates journaling. It seems to work for a time . . . then it doesn’t.

Mount your gjournal filesystems async. The normal warnings that apply to async mounts don’t apply to gjournal, however. The gjournal GEOM module handles the verification and integrity checking normally managed by the filesystem.

/dev/da3p5.journal /var/log ufs rw,async 2 2

The documentation says that you can convert an existing partition to use gjournal, provided that you have a separate partition for the journal and that the last sector of the existing filesystem is empty. In practice, I find that the last sector of the existing filesystem is always full, but if you want to, try to read gjournal(8) for the details.

Tuning UFS

You can view and change the settings on each UFS filesystem by using tunefs(8). This lets you enable and disable features; plus, you can adjust how UFS writes files, manages free space, and uses filesystem labels.

View Current Settings

View a filesystem’s current settings with the -p flag and the partition’s current mount point or underlying provider.

   # tunefs -p /dev/gpt/var
   tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       enabled
tunefs: gjournal: (-J)                                     disabled
   tunefs: trim: (-t)                                         disabled
   tunefs: maximum blocks per file in a cylinder group: (-e)  4096
   tunefs: average file size: (-f)                            16384
   tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)              8%
   tunefs: space to hold for metadata blocks: (-k)            6408
   tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)

Many of the available settings relate to specific security functionality we don’t cover. Topics like MAC restrictions and all the different types of ACL fill entire books. But we can see that this filesystem uses soft updates and soft updates journaling , though it doesn’t use gjournal . We get the minimum amount of free space . At the end, we have the nonexistent UFS label . We get a bunch of information on filesystem geometry and block size.

Use tunefs(8) to change any of these settings on an unmounted filesystem. Conveniently, tunefs(8) shows the command line flag to address each. I normally boot into single-user mode before changing a filesystem’s settings.

You might notice that you can adjust all sorts of filesystem internals, such as block arrangements and filesystem geometry. Don’t. In over two decades of FreeBSD use, I have never seen anyone improve their situation by twiddling these knobs. I have repeatedly seen people twiddle these knobs and ruin their day.

But let’s look at the settings you might actually need to enable and disable.

Soft Updates and Journaling

Use the -j flag to enable or disable soft updates journaling on a filesystem. This automatically enables soft updates.

# tunefs -j enable /dev/gpt/var
Using inode 5 in cg 0 for 33554432 byte journal
tunefs: soft updates journaling set

To disable soft updates journaling, use the disable keyword.

# tunefs -j disable /dev/gpt/var
Clearing journal flags from inode 5
tunefs: soft updates journaling cleared but soft updates still set.
tunefs: remove .sujournal to reclaim space

A soft updates journal on a nonjournaled filesystem can only confuse matters. Mount the filesystem and remove the .sujournal file in the filesystem’s root directory. Note that turning off journaling leaves soft updates still in place. Use -n enable and -n disable to turn soft updates (without journaling) on and off.

Minimum Free Space

UFS holds back 8 percent of each partition so that it has space to rearrange files for better performance. I discuss this further in “UFS Space Reservations” on page 249. If you want to change this percentage, use the -m flag. Here, I tell the filesystem to reserve only 5 percent of the disk.

# tunefs -m 5 /dev/gpt/var
tunefs: minimum percentage of free space changes from 8% to 5%
tunefs: should optimize for space with minfree < 8%

You should now have more usable disk space. Also, UFS will run more slowly because it always packs the filesystem as tightly as possible.

SSD TRIM

Solid-state disks use wear-leveling to extend their lifespan. Wear-leveling works best if the filesystem notifies the SSD when each block is no longer in use. The TRIM protocol handles this notification. Enable TRIM support on your SSD-backed filesystem with the -t flag.

# tunefs -t enable /dev/gpt/var
tunefs: issue TRIM to the disk set

For the best results, enable TRIM for every partition on a solid-state drive. Enable TRIM at filesystem creation with newfs -E.

Labeling UFS Filesystems

You can apply a UFS label to an existing filesystem with the -L flag.

# tunefs -L scratch /dev/ada3s1e

Don’t mix UFS and GPT labels—you’ll only confuse yourself.

Expanding UFS Filesystems

Your virtual machine runs out of space? Make the disk bigger, and expand the last partition to cover that space, as discussed in Chapter 10. But what about the filesystem on that partition? That’s where growfs(8) comes in.

The growfs(8) command expands an existing UFS filesystem to fill the partition it’s in. Give growfs one argument, the filesystem’s device node. Use labels if you like.

# growfs /dev/gpt/var
It's strongly recommended to make a backup before growing the file system.
OK to grow filesystem on /dev/gpt/var from 50.0GB to 100GB? [Yes/No] yes
super-block backups (for fsck_ffs -b #) at:
 19233792, 20516032, 21798272, 23080512, 24362752,
--snip--

When growfs(8) requests confirmation , you must enter the full word yes. Any other answer, including a plain y like many other programs accept, cancels the operation. Confirm the operation and growfs(8) will add additional blocks, superblocks, and inodes as needed to fill the partition.

If you don’t want the filesystem to fill the entire partition, you can specify a size with -s. Here, I expand this same partition to 80GB.

# growfs -s 80g /dev/gpt/var

I strongly encourage you to make filesystems the same size as the underlying partitions, unless you’re looking to make your coworkers slap you.2

UFS Snapshots

You can take an image of a UFS filesystem at a moment in time; this is called a snapshot. You can snapshot a filesystem, erase and change some files, and then copy the unchanged files from the snapshot. Tools like dump(8) use snapshots to ensure consistent backups. UFS snapshots are not as powerful or flexible as ZFS snapshots, but they’re a solid, reliable tool within their limits.

UFS snapshots require soft updates but are incompatible with soft updates journaling. Each filesystem can have up to 20 snapshots.

Snapshots let you get at the older version of an edited or removed file. Access the contents of a snapshot by mounting the file as a memory device. I’ll discuss memory devices in Chapter 13.

Taking and Destroying Snapshots

Create snapshots with mksnap_ffs(8). This program assumes you want to make a snapshot of the filesystem your current working directory is in. Give the snapshot location as an argument. Snapshots traditionally go in the .snap directory at the filesystem root. If you’re using a tool that automatically creates and removes snapshots, like dump(8), check there for your snapshot files. If you don’t like that location, though, you can put them anywhere on the filesystem you’re taking the snapshot of. Here, I took a snapshot of the /home filesystem:

# cd /home
# mksnap_ffs .snap/beforeupgrade

Snapshots use disk space. You can’t take a snapshot of a full filesystem.

A snapshot is just a file. Remove the file and you destroy the snapshot.

Finding Snapshots

Snapshots are files, and you can put them anywhere on the filesystem. This means it’s easy to lose them. Use find(1) with the -flags snapshot option to find all snapshots on a filesystem.

# find /usr -flags snapshot
/usr/.snap/beforeupgrade
/usr/.snap/afterupgrade
/usr/local/testsnap

There’s my stray snapshot!

Snapshot Disk Usage

A snapshot records the differences between the current filesystem and the filesystem as it existed when the snapshot was taken. Every filesystem change after taking a snapshot increases the size of the snapshot. If you remove a file, the snapshot retains a copy of that file so you can recover it later.

This means deleting data from a filesystem with snapshots doesn’t actually free up space. If you have a snapshot of your /home partition and you delete a file, the deleted file gets added to the snapshot.

Make sure that filesystems with snapshots always have plenty of free space. If you try to take a snapshot and mksnap_ffs(8) complains that it can’t because there’s no space, you might already have 20 snapshots of that filesystem.

UFS Recovery and Repair

Everything from faulty hardware to improper systems administration3 can damage your filesystems. All of UFS’s resilience technologies are designed to quickly restore data integrity, but nothing can completely guarantee integrity.

Let’s discuss how FreeBSD keeps each UFS filesystem tidy.

System Shutdown: The Syncer

When you shut down a FreeBSD system, the kernel synchronizes all its data to the hard drive, marks the disks clean, and shuts down. This is done by a kernel process called the syncer. During a system shutdown, the syncer reports on its progress in synchronizing the hard drive.

You’ll see odd things from the syncer during shutdown. The syncer walks the list of vnodes that need synchronizing to disk, allowing it to support all filesystems, not just UFS. Thanks to soft updates, writing one vnode to disk can generate another dirty vnode that needs updating. You can see the number of buffers being written to disk rapidly drop from a high value to a low value and perhaps bounce between zero and a low number once or twice as the system really, truly synchronizes the hard drive.

If the syncer doesn’t get a chance to finish, or if the syncer doesn’t run at all thanks to your ham-fisted fumbling, you get a dirty filesystem.

Dirty Filesystems

No, disks don’t get muddy with use (although dust on a platter will quickly damage it, and adding water won’t help). A dirty UFS partition is in a kind of limbo; the operating system has asked for information to be written to the disk, but the data is not yet completely on the physical media. Part of the data blocks might have been written, the inode might have been edited but the data not written out, or any combination of the two. Live filesystems are almost always dirty.

If a host with dirty filesystems fails—say, due to a panic or Bert tripping over the power cable, the filesystem is still dirty when the system boots again. The kernel refuses to mount a dirty filesystem.

Cleaning the filesystem restores data integrity but doesn’t necessarily mean that all your data is on the disk. If a file was half-written to disk when the system died, the file is lost. Nothing can restore the missing half of the file, and the half on disk is essentially useless.

Journaled filesystems should automatically recover when FreeBSD tries to mount them. If the filesystem can’t recover, or if you don’t have a journal, you’ll need to use the legendary fsck(8).

File System Checking: fsck(8)

The fsck(8) program examines a UFS filesystem and tries to verify that every file is attached to the proper inodes and in the correct directory. It’s like verifying a database’s referential integrity. If the filesystem suffered only minor damage, fsck(8) can automatically restore integrity and put the filesystem back in service.

Repairing a damaged filesystem takes time and memory. A fsck(8) run requires about 700MB of RAM to analyze a 1TB filesystem. Most computer systems have fairly proportional memory and storage systems: very few hosts have 512MB RAM and petabytes of disk. But you should know it’s possible to create a UFS filesystem so large that the system doesn’t have enough memory to repair it.

Manual fscks Runs

Occasionally this automated fsck-on-reboot fails to work. When you check the console, you’ll be looking at a single-user mode prompt and a request to run fsck(8) manually.

Start by preening the filesystem with fsck -p. This automatically corrects a bunch of less severe errors without asking for your approval. Preening causes data loss only rarely. This is frequently successful, but if it doesn’t work, it will ask you to run a “full fsck.”

If you enter fsck at the command prompt, fsck(8) verifies every block and inode on the disk. It finds any blocks that have become disassociated from their inodes and guesses how they fit together and how they should be attached. However, fsck(8) might not be able to identify which directory these files belong in.

Then, fsck(8) asks whether you want to perform these reattachments. If you answer n, it deletes the damaged files. If you answer y, it adds the lost file to a lost+found directory in the root of the partition, with a number as a filename. For example, the lost+found directory on your /usr partition is /usr/lost+found. If there are only a few files, you can identify them manually; if you have many files and are looking for particular ones, tools such as file(1) and grep(1) can help you identify them by content.

If you answer n, those nuggets of unknown data remain detached from the filesystem. The filesystem remains dirty until you fix them by some other means.

Trusting fsck(8)

If fsck(8) can’t figure out where a file goes . . . can you? If not, you really have no choice but to trust fsck(8) to recover your system or restore from backup.

A full fsck(8) run inspects every block, inode, and superblock, and identifies every inconsistency. It asks you to type y or n to approve or reject every single correction. Any change you reject you must fix yourself, through some other means. You might spend hours at the console typing y, y, y.

So I’ll ask again: if fsck(8) can’t fix a problem, can you?

If you can’t, consider fsck -y. The -y flag tells fsck(8) to reassemble these files as best it can, without prompting you. It assumes you answer all its questions “yes,” even the really dangerous ones. Using -y automatically triggers -R, which tells fsck(8) to retry cleaning each filesystem until it succeeds or it’s had 10 consecutive failures. It’s cure or kill. You do have backups, right?

You can set your system to try fsck -y automatically on boot. I don’t recommend this, however, because if there’s the faintest chance my filesystem will wind up in digital nirvana, I want to know about it. I want to type the offending command myself and feel the trepidation of hearing my disks churn. Besides, it’s always unpleasant to discover that your system is trashed without having the faintest clue how it got that way. If you’re braver than I, set fsck_y_enable="YES" in rc.conf.

Avoiding fsck -y

What options do you have if you don’t want to use fsck -y? Well, fsdb(8) and clri(8) allow you to debug the filesystem and redirect files to their proper locations. You can restore files to their correct directories and names. This is difficult,4 however, and is recommended only for Secret Ninja Filesystem Masters.

Background fsck

Background fsck gives UFS some of the benefits of a journaled filesystem without actually requiring journaling. You must be using soft updates without journaling to use background fsck. (Soft updates with journaling is far, far preferable to background fsck.) When FreeBSD sees that a background fsck is in process after a reboot, it mounts the dirty disk read-write. While the server is running, fsck(8) runs in the background, identifying loose bits of files and tidying them up behind the scenes.

A background fsck actually has two major stages. When FreeBSD finds dirty disks during the initial boot process, it runs a preliminary fsck(8) assessment of the disks. The fsck(8) program decides whether the damage can be repaired while the system is running or whether a full single-user mode fsck run is required. Most frequently, fsck thinks it can proceed and lets the system boot. After the system reaches single-user mode, the background fsck runs at a low priority, checking the partitions one by one. The results of the fsck process appear in /var/log/messages.

You can expect performance of any applications requiring disk activity to be lousy during a background fsck. The fsck(8) program occupies a large portion of the disk’s possible activity. While your system might be slow, it will at least be up.

You must check /var/log/messages for errors after a background fsck. The preliminary fsck assessment can make an error, and perhaps a full single-user mode’s fsck on a partition really is required. If you find such a message, schedule downtime within a few hours to correct the problem. While inconvenient, having the system down for a scheduled period is better than the unscheduled downtime caused by a power outage and the resulting single-user mode’s fsck -y.

Forcing Read-Write Mounts on Dirty Disks

If you really want to force FreeBSD to mount a dirty disk read-write without using a background fsck, you can. You won’t like the results. At all. But, as it’s described in mount(8), some reader will think it’s a good idea unless they know why. Use the -w (read-write) and -f (force) flags to mount(8).

Mounting a dirty partition read-write corrupts data. Note the absence of words like might and could from that sentence. Also note I didn’t use recoverable. Mounting a dirty filesystem may panic your computer. It might destroy all remaining data on the partition or even shred the underlying filesystem. Forcing a read-write mount of a dirty filesystem is seriously bad juju. Don’t do it.

Background fsck, fsck -y, Foreground fsck, Oy Vey!

All these different fsck(8) problems and situations can occur, but when does FreeBSD use each command? FreeBSD uses the following conditions to decide when and how to fsck(8) on a filesystem:

  • If the filesystem is clean, it is mounted without fsck(8).
  • If a journaled filesystem is dirty at boot, FreeBSD recovers the data from the journal and continues the boot. A journaled filesystem rarely needs fsck(8).
  • If a filesystem without soft updates is dirty at boot, FreeBSD runs fsck(8) on it. If the filesystem damage is severe, FreeBSD stops checking and requests your intervention. You can either run fsck -y or manually approve each correction.
  • If a filesystem with soft updates is dirty at boot, FreeBSD performs a very basic fsck(8) check. If the damage is mild, FreeBSD can use a background fsck(8) in multiuser mode.
  • If the damage is severe, or you don’t want background fsck(8), FreeBSD interrupts the boot and requests a manual fsck(8).

Consider the recovery path when configuring your UFS filesystems.

UFS Space Reservations

A UFS filesystem is never quite as large as you think it should be. UFS holds back 8 percent of the filesystem space for on-the-fly optimization. Only root can write over that limit. That’s why a filesystem can seem to use more than 100 percent of the available space. Why 8 percent? That number’s the result of many years of experience and real-world testing. That 8 percent holdback isn’t a big deal on average filesystems, but as the filesystem grows, it can be considerable. On a 1PB disk array, UFS holds 80TB in reserve.

UFS behaves differently depending on how full a filesystem gets. On an empty filesystem, it optimizes for speed. Once the filesystem hits 92 percent full (85 percent of the total size, including the 8 percent reserve), it switches to optimize space utilization. Most people do the same thing—once you mostly fill up the laundry hamper, you can jam more dirty clothes in, but it takes a little more time and effort. UFS fragments files to use space more effectively. Fragments reduce disk performance. As free space shrinks, UFS works harder and harder to improve space utilization. A full UFS filesystem runs at about one-third the normal speed.

You might want to use tunefs(8) to reduce the amount of disk space FreeBSD holds in reserve. It won’t help as much as you think. Reducing the reserve to 5 percent or less tells UFS to always use space optimization and pack the filesystem as tightly as possible.

Increasing the reserved space percentage doesn’t improve performance. If you increase the reserved space percentage so that your filesystem appears full, regular users won’t be able to write files.5

The reserved space can confuse tools such as NFS. Some other operating systems that can mount UFS over NFS see that a filesystem is 100 percent full and tell the user they can’t write files, despite local clients being able to write files. Remember this when troubleshooting.

The best thing to do is to keep your partition from filling up.

How Full Is a Partition?

To get an overview of how much space each UFS partition has left, use df(1). This lists the partitions on your system, the amount of space each uses, and where it’s mounted. (Don’t use df(1) with ZFS; we’ll discuss why in the next chapter.)

The -h and -H flags tell df(1) to produce human-readable output rather than using blocks. The small -h uses base 2 to create a 1,024-byte megabyte, while the large -H uses base 10 for a 1,000-byte megabyte. Typically, network administrators and disk manufacturers use base 10, while system administrators use base 2. Either works so long as you know which you’ve chosen. I’m a network administrator, so you get to suffer through my prejudices in these examples, despite what my tech editor thinks.

   # df -H
Filesystem      Size    Used   Avail Capacity  Mounted on
/dev/gpt/root   1.0G    171M    785M      18%  /
   devfs           1.0k    1.0k      0B     100%  /dev
   /dev/gpt/var    1.0G     64M    892M       7%  /var
   /dev/gpt/tmp    1.0G    8.5M    948M       1%  /tmp
/dev/gpt/usr     14G   13.8G    203M      98%  /usr

The first line shows us column headers for the provider name, the size of the partition, the amount of space used, the amount of space available, the percent of space used, and the mount point. We can see that the partition labeled /dev/gpt/root is only 1GB in size but has only 171MB on it, leaving 785MB free. It’s 18 percent full and mounted on /.

If your systems are like mine, disk usage somehow keeps growing for no apparent reason. Look at the /usr partition here. It’s 98 percent full. You can identify individual large files with ls -l, but recursively doing this on every directory in the system is impractical.

The du(1) program displays disk usage in a single directory. Its initial output is intimidating and can scare off inexperienced users. Here, we use du(1) to find out what’s taking up all the space in my home directory:

# cd $HOME
# du
1       ./bin/RCS
21459   ./bin/wp/shbin10
53202   ./bin/wp
53336   ./bin
5       ./.kde/share/applnk/staroffice_52
6       ./.kde/share/applnk
--snip--

This goes on and on, displaying every subdirectory and giving its size in blocks. The total of each subdirectory is given—for example, the contents of $HOME/bin totals 53,336 blocks, or roughly 53MB. I could sit and let du(1) list every directory and subdirectory, but then I’d have to dig through much more information than I really want to. And blocks aren’t that convenient a measurement, especially not when they’re printed left-justified.

Let’s clean this up. First, du(1) supports an -h flag much like df. Also, I don’t need to see the recursive contents of each subdirectory. We can control the number of directories we display with du’s -d flag. This flag takes one argument, the number of directories you want to explicitly list. For example, -d0 goes one directory deep and gives a simple subtotal of the files in a directory.

# du -h -d0 $HOME
 14G    /home/mwlucas

I have 14 gigs of data in my home directory? Let’s look a layer deeper and identify the biggest subdirectory.

# du -h -d1
 38K    ./bin
 56M    ./mibs
--snip--
 13G    ./startrekgifs
--snip--

Apparently I must look elsewhere for storage space, as the data in my home directory is too important to delete. Maybe I should just grow the virtual disk under this host.

If you’re not too attached to the -h flag, you can use sort(1) to find the largest directory with a command like du -kxd 1 | sort -n.

Adding New UFS storage

No matter how much planning you do, eventually your hard drives will fill up. You’ll need to add disks. Before you can use a new hard drive, you must partition the drive, create filesystems, mount those filesystems, and move data to them.

Give the design of your new disk partitioning and filesystems as much thought as you did the initial install. It’s much easier to partition disks correctly at install than to go back and repartition disks with data on them.

Partitioning the Disk

While you can partition the disk any way you like, I recommend that new disks use the same partitioning scheme as the rest of the host. Having one disk partitioned with MBR and one with GPT is annoying. I’ll use GPT for this example.

Decide how you want to divide the disk. This is a 1TB disk. 100GB will go to an expanded /tmp. I’ll dedicate 500GB to my new database partition. The remaining space gets partitioned off but labeled emergency. I won’t put a filesystem in that space; it’s there in case I need to do a full memory dump or have to put some files somewhere. I’m putting it right next to the database partition so I can grow the database partition if needed. I could leave the emergency space unpartitioned, but I want it to have a GPT label so that my fellow sysadmins realize this free space isn’t accidental.

Start by destroying any partitioning scheme on the disk and creating a GPT scheme.

# gpart destroy -F da3
da3 destroyed
# gpart create -s gpt da3
da3 created

Now create your 100GB /tmp and 500GB data partitions, and dump the rest into the emergency partition.

# gpart add -t freebsd-ufs -l tmp -s 100g da3
da3p1 added
# gpart add -t freebsd-ufs -l postgres -s 500g da3
da3p2 added
# gpart add -t freebsd-ufs -l emergency da3
da3p3 added

Check your work with gpart show.

# gpart show -lp da3
=>        40  1953525088    da3  GPT  (932G)
          40   209715200  da3p1  tmp  (100G)
   209715240  1048576000  da3p2  postgres  (500G)
  1258291240   695233888  da3p3  emergency  (332G)

Create filesystems on each partition.

# newfs -j /dev/gpt/tmp
# newfs -j /dev/gpt/postgres

As /tmp gets emptied at every boot, I would prefer not to use soft updates journaling on /tmp. Instead, I’d mount /tmp async and run newfs /dev/gpt/tmp at boot. Many times, newfs(8) is faster than rm(1).

Configuring /etc/fstab

Now tell /etc/fstab about your filesystems. We discuss the format of /etc/fstab in Chapter 10.

/dev/gpt/postgres  /usr/local/etc/postgres ufs  rw  0  2
/dev/gpt/tmp       /tmp                    ufs  rw  0  2

FreeBSD will recognize the filesystems at boot, or you can mount these new partitions at the command line. Don’t reboot or mount the partitions just yet, though. First you’ll want to move files to those filesystems.

Installing Existing Files onto New Disks

Chances are that you intend your new disk to replace or subdivide an existing partition. You’ll need to mount your new partition on a temporary mount point, move files to the new disk, then remount the partition at the desired location. While /tmp doesn’t have any files, if we’re installing a new database filesystem, we presumably have database files to put there.

Before moving files, shut down any process using them. You cannot successfully copy files that are being changed as you copy them. If you’re moving your database files, shut down your database. If you’re moving your mail spool, shut down all of your mail programs. This is a big part of why I recommend doing all new disk installations in single-user mode.

Now mount your new partition on a temporary mount point. That’s exactly what /mnt is for.

# mount /dev/gpt/postgres /mnt

Now you must move the files from their current location to the new disk without changing their permissions. This is fairly simple with tar(1). You can simply tar up your existing data to a tape or a file and untar it in the new location, but that’s kind of clumsy. Pipe one tar into another to avoid the middle step.

# tar cfC - /old/directory . | tar xpfC - /tempmount

If you don’t speak Unix at parties, this looks fairly stunning. Let’s dismantle it. First, we go to the old directory and tar up everything. Then, pipe the output to a second command, which extracts the backup in the new directory. When this command finishes, your files are installed on their new disk. For example, to move /usr/local/etc/postgres onto a new partition temporarily mounted at /mnt, you would do the following:

# tar cfC - /usr/local/etc/postgres . | tar xpfC - /mnt

Check the temporary mount point to be sure that your files are actually there. Once you’re confident that the files are properly moved, remove the files from the old directory and mount the disk in the new location. For example, after duplicating your files from /usr/local/etc/postgres, you’d run:

# rm -rf /usr/local/etc/postgres
# umount /mnt
# mount /usr/local/etc/postgres

You can now resume normal operation. I recommend rebooting to verify that everything comes back exactly as you intended.

Stackable Mounts

Maybe you don’t care about your old data; you want to split an existing filesystem only to get more space and you intend to recover your data from backup. That’s fine. All FreeBSD filesystems are stackable. This is an advanced idea that’s not terribly useful in day-to-day system administration, but it can bite you when you try to split one partition into two.

Suppose, for example, that you have data in /usr/src. See how much space is used on your disk, and then mount a new empty partition on /usr/src. If you look in the directory afterward, you’ll see that it’s empty.

Here’s the problem: the old filesystem still has all its original data on it. The new filesystem is mounted “above” the old filesystem, so you see only the new filesystem. The old filesystem has no more free space than before you moved the data. If you unmount the new filesystem and check the directory again, you’ll see the data miraculously restored! The new filesystem obscured the lower filesystem.

Although you can’t see the data, data on the old filesystem still takes up space. If you’re adding a filesystem to gain space, and you mount a new filesystem over part of the old, you won’t free any space on your original filesystem. The moral is: even if you’re restoring your data from backup, make sure that you remove that data from your original disk to recover disk space.

Now that you can talk UFS, let’s explore ZFS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset