12
THE Z FILE SYSTEM

image

Most filesystems are, in computing terms, ancient. We discard 5-year-old hardware because it’s painfully slow, but we format the replacement’s hard drive with a 40-year-old filesystem. While we’ve improved those filesystems and made them more robust, they still use the same basic architecture. And every time a filesystem breaks, we curse and scramble to fix it while desperately wishing for something better.

ZFS is something better.

It’s not that ZFS uses revolutionary technology. All the individual pieces of ZFS are well understood. There’s no mystery to hashes or data trees or indexing. But ZFS combines all of these well-understood principles into a single cohesive, well-engineered whole. It’s designed with the future in mind. Today’s hashing algorithm won’t suffice 15 years from now, but ZFS is designed so that new algorithms and techniques can be added to newer versions without losing backward compatibility.

This chapter won’t cover all there is to know about ZFS. ZFS is almost an operating system on its own, or perhaps a special-purpose database. Entire books have been written about using and managing ZFS. You’ll learn enough about how ZFS works to use it on a server, though, and understand its most important features.

While ZFS expects to be installed directly on a disk partition, you can use other GEOM providers as ZFS storage. The most common example is when you do an install with encrypted disks. FreeBSD puts a geli(8) geom on the disk and installs ZFS atop that geom. This chapter calls any storage provider a “disk,” even though it could be a file or an encrypted provider or anything else.

If you’ve never worked with ZFS before, install a ZFS-based FreeBSD system on a virtual machine and follow along. The installer automatically handles prerequisites, like setting zfs_load=YES in loader.conf and zfs_enable=YES in rc.local; all you need concern yourself with is the filesystem.

ZFS blends a whole bunch of well-understood technologies into a combination volume manager and filesystem. It expects to handle everything from the permissions on a file down to tracking which blocks on which storage provider get which information. As the sysadmin, you tell ZFS which hardware you have and how you want it configured, and ZFS takes it from there.

ZFS has three main components: datasets, pools, and virtual devices.

Datasets

A dataset is defined as a named chunk of ZFS data. The most common dataset resembles a partitioned filesystem, but ZFS supports other types of datasets for other uses. A snapshot (see “Snapshots” on page 271) is a dataset. ZFS also includes block devices for virtualization and iSCSI targets, clones, and more; all of those are datasets. This book focuses on filesystem datasets. Traditional filesystems like UFS have a variety of small programs to manage filesystems, but you manage all ZFS datasets with zfs(8).

View your existing datasets with zfs list. The output looks a lot like mount(8).

   # zfs list
   NAME                     USED  AVAIL  REFER  MOUNTPOINT
zroot                   4.71G   894G    88K  none
zroot/ROOT              2.40G   894G    88K  none
zroot/ROOT/2018-11-17      8K   894G  1.51G  /
zroot/ROOT/default      2.40G   894G  1.57G  /
zroot/usr               1.95G   894G    88K  /usr
zroot/usr/home           520K   894G   520K  /usr/home
   --snip--

Each line starts with the dataset name, starting with the storage pool—or zpool—that the dataset is on. The first entry is called zroot . This entry represents the pool’s root dataset. The rest of the dataset tree dangles off this dataset.

The next two columns show the amount of space used and available. The pool zroot has used 4.71GB and has 894GB available. While the available space is certainly correct, the 4.71GB is more complicated than it looks. The amount of space a dataset shows under USED includes everything on that dataset and on all of its children. A root dataset’s children include all the other datasets in that zpool.

The REFER column is special to ZFS. This column shows the amount of data accessible on this specific dataset, which isn’t necessarily the same as the amount of space used. Some ZFS features, such as snapshots, share data between themselves. This dataset has used 4.71GB of data but refers to only 88KB. Without its children, this dataset has only 88KB of data on it.

At the end, we have the dataset’s mount point. This root dataset doesn’t have a mount point; it’s not mounted.

Look at the next dataset, zroot/ROOT . This is a dataset created for the root directory and associated files. That seems sensible, but if you look at the REFER column, you’ll see it also has only 88KB of data inside it, and there’s no mount point. Shouldn’t the root directory exist?

The next two lines explain why . . . sort of. The dataset zroot/ROOT/2018-11-17 has a mountpoint of /, so it’s a real root directory. The next dataset, zroot/ROOT/default , also has a mountpoint of /. No, ZFS doesn’t let you mount multiple datasets at the same mount point. A ZFS dataset records a whole bunch of its settings within the dataset. The mount point is one of those settings.

Consider these four datasets for a moment. The zroot/ROOT dataset is a child of the zroot dataset. The zroot/ROOT/2018-11-17 and zroot/ROOT/default datasets are children of zroot/ROOT. Each dataset has its children’s space usage billed against it.

Why do this? When you boot a FreeBSD ZFS host, you can easily choose between multiple root directories. Each bootable root directory is called a boot environment. Suppose you apply a patch and reboot the system, but the new system won’t boot. By booting into an alternate boot environment, you can easily access the defective root directory and try to figure out the problem.

The next dataset, zroot/usr , is a completely different child of zroot. It has its own child, zroot/usr/home . The space used in zroot/usr/home gets charged against zroot/usr, and both get charged against its parent, but their allocation doesn’t affect zroot/ROOT.

Dataset Properties

Beyond some accounting tricks, datasets so far look a lot like partitions. But a partition is a logical subdivision of a disk, filling very specific LBAs on a storage device. Partitions have no awareness of the data on the partition. Changing a partition means destroying the filesystem on it.

ZFS tightly integrates the filesystem and the lower storage layers. It can dynamically divide storage space between the various filesystems as needed. Where partitions control the number of available blocks to constrain disk usage, datasets can use quotas for the same effect. Without those quotas, though, if a pool has space, you can use it.

The amount of space a dataset can use is a ZFS property. ZFS supports dozens of properties, from the quotas property that controls how large a dataset can grow to the mounted property that shows whether a dataset is mounted.

Viewing and Changing Dataset Properties

Use zfs set to change properties.

# zfs set quota=2G zroot/usr/home

View a property with zfs get. You can either specify a particular property or use all to view all properties. You can list multiple properties by separating them with commas. If you specify a dataset name, you affect only that dataset.

# zfs get mounted zroot/ROOT
NAME       PROPERTY  VALUE    SOURCE
zroot/ROOT  mounted     no    -

Here, we have the dataset’s name, the property, the property value, and something called source. (We’ll talk about that last one in “Property Inheritance” on page 261.)

My real question is, which dataset is mounted as the root directory? I could check the two datasets with a mount point of /, but when I get dozens of boot environments, that will drive me nuts. Check a property for a dataset and all of its children by adding the -r flag.

# zfs get -r mounted zroot/ROOT
NAME                                   PROPERTY  VALUE    SOURCE
zroot/ROOT                              mounted     no    -
zroot/ROOT/2018-11-17                   mounted     no    -
zroot/ROOT/default                      mounted   yes    -

Of the three datasets, only zroot/ROOT/default is mounted. That’s our active boot environment.

Property Inheritance

Many properties are inheritable. You set them on the parent dataset and they percolate down through the children. Inheritance doesn’t make sense for properties like mount points, but it’s right for certain more advanced features. While we’ll look at what the compression property does in “Compression” on page 273, we’ll use it as an example of inheritance here.

# zfs get compression
NAME                     PROPERTY     VALUE     SOURCE
zroot                    compression  lz4       local
zroot/ROOT               compression  lz4       inherited from zroot
zroot/ROOT/2018-11-17    compression  lz4       inherited from zroot
zroot/ROOT/default       compression  lz4       inherited from zroot
zroot/tmp                compression  lz4       inherited from zroot
--snip--

The root dataset, zroot, has the compression property set to lz4. The source is local, meaning that this property is set on this dataset. Now look at zroot/ROOT. The compression property is also lz4, but the source is inherited from zroot. This dataset inherited this property setting from its parent.

Managing Datasets

ZFS uses datasets much as traditional filesystems use partitions. Manage datasets with zfs(8). You’ll want to create, remove, and rename datasets.

Create Datasets

Create datasets with zfs create. Create a filesystem dataset by specifying the pool and the dataset name. Here, I create a new dataset for my packages. (Note that this breaks boot environments, as we’ll see later this chapter.)

# zfs create zroot/usr/local

Each dataset must have a parent dataset. A default FreeBSD install has a zroot/usr dataset, so I can create a zroot/usr/local. I’d like to have a dataset for /var/db/pkg, but while FreeBSD comes with a zroot/var dataset, there’s no zroot/var/db. I’d need to create zroot/var/db and then zroot/var/db/pkg.

Note that datasets are stackable, just like UFS. If I have files in my /usr/local directory and I create a dataset over that directory, ZFS will mount the dataset over the directory. I will lose access to those files. You must shuffle files around to duplicate existing directories.

Destroying and Renaming Datasets

That new zroot/usr/local dataset I created? It hid the contents of my /usr/local directory. Get rid of it with zfs destroy and try again.

# zfs destroy zroot/usr/local

The contents of /usr/local reappear. Or, I could rename that dataset instead, using zfs rename.

# zfs rename zroot/usr/local zroot/usr/new-local

I like boot environments, though, so I’m going to leave /usr/local untouched. Sometimes you really need a /usr/local dataset, though . . .

Unmounted Parent Datasets

As a Postgres user, I want a separate dataset for my Postgres data. FreeBSD’s Postgres 9.6 package uses /var/db/pgsql/data96. I can’t create that dataset without having a dataset for /var/db, and I can’t have that without breaking boot environment support for packages. What to do?

The solution is to create a dataset for /var/db, but not to use it, by setting the canmount dataset property. This property controls whether or not a dataset can be mounted. FreeBSD uses an unmounted dataset for /var for exactly this reason. New datasets automatically set canmount to on, so you normally don’t have to worry about it. Use the -o flag to set a property at dataset creation.

# zfs create -o canmount=off zroot/var/db

The dataset for /var/db exists, but it can’t be mounted. Check the contents of your /var/db directory to verify everything’s still there. You can now create a dataset for /var/db/postgres and even /var/db/pgsql/data96.

# zfs create zroot/var/db/postgres
# zfs create zroot/var/db/postgres/data96
# chown -R postgres:postgres /var/db/postgres

You have a dataset for your database, and you still have the files in /var/db itself as part of the root dataset. Now initialize your new Postgres database and go!

As you explore ZFS, you’ll find many situations where you might want to set properties at dataset creation or use unmounted parent datasets.

Moving Files to a New Dataset

If you need to create a new dataset for an existing directory, you’ll need to copy the files over. I recommend you create a new dataset with a slightly different name, copy the files to that dataset, rename the directory, and then rename the dataset. Here, I want a dataset for /usr/local, so I create it with a different name.

# zfs create zroot/usr/local/pgsql-new

Copy the files with tar(1), exactly as you would for a new UFS partition (see Chapter 11).

# tar cfC - /usr/local/pgsql . | tar xpfC - /usr/local/pgsql-new

Once it finishes, move the old directory out of the way and rename the dataset.

# mv /usr/local/pgsql /usr/local/pgsql-old
# zfs rename zroot/usr/local/pgsql-new zroot/usr/local/pgsql

My Postgres data now lives on its own dataset.

ZFS Pools

ZFS organizes its underlying storage in pools, rather than by disk. A ZFS storage pool, or zpool, is an abstraction of the underlying storage devices, letting you separate the physical medium and the user-visible filesystem on top of it.

View and manage a host’s ZFS pools with zpool(8). Here, I use zpool list to see the pools from one of my hosts.

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot     928G  4.72G   923G         -     0%     0%  1.00x  ONLINE  -
jail      928G  2.70G   925G         -     0%     0%  1.00x  ONLINE  -
scratch   928G  5.94G   922G         -     0%     0%  1.00x  ONLINE  -

This host has three pools: zroot, jail, and scratch. Each has its own line.

The SIZE column shows us the total capacity of the pool. All of these pools can hold 928GB. The ALLOC column displays how much of each pool is in use, while FREE shows how much space remains. These disks are pretty much empty, which makes sense as I installed this host only about three hours ago.

The EXPANDSZ column shows whether the underlying storage providers have any free space. When a pool has virtual device redundancy (which we’ll discuss in the next section), you can replace individual storage devices in the pool and make the pool larger. It’s like swapping out the 5TB drives in your RAID array with 10TB drives to make it bigger.

The FRAG column shows how much fragmentation this pool has. You’ve heard over and over that fragmentation slows performance. ZFS minimizes the impact of fragmentation, though.

The CAP column shows what percentage of the available space is used.

The DEDUP column shows whether this pool uses deduplication. While many people trumpet deduplication as a ZFS feature, it’s not as useful as you might hope.

The HEALTH column displays whether the pool is working well or the underlying disks have a problem.

Pool Details

You can get more detail on pools, or on a single pool, by running zpool status. If you omit the pool name, you’ll see this information for all of your pools. Here, I check the status of my jail pool.

# zpool status jail
  pool: jail
 state: ONLINE
  scan: none requested
config:

        NAME               STATE     READ WRITE CKSUM
        jail               ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            gpt/da2-jail   ONLINE       0     0     0
            gpt/ada2-jail  ONLINE       0     0     0

errors: No known data errors

We start with the pool name. The state is much like the HEALTH column; it displays any problems with the pool. The scan field shows information on scrubs (see “Pool Integrity and Repair” on page 273).

We then have the pool configuration. The configuration shows the layout of the virtual devices in the pool. We’ll dive into that when we create our pools.

Pool Properties

Much like datasets, zpools have properties that control and display the pool’s settings. Some properties are inherently informational, such as the free property that expresses how much free space the pool has. You can change others.

Viewing Pool Properties

To view all of a pool’s properties, use zpool get. Add the property all to view every property. You can add a pool name to include only that pool.

# zpool get all zroot
NAME   PROPERTY       VALUE                          SOURCE
zroot  size           928G                           -
zroot  capacity       0%                             -
zroot  health         ONLINE                         -
zroot  guid           7955546176707282768            default
--snip--

Some of this information gets pulled into commands like zpool status and zpool list. You can also query for individual properties across all pools by using the property name.

# zpool get readonly
NAME     PROPERTY  VALUE   SOURCE
zroot    readonly  off     -
jail     readonly  off     -
scratch  readonly  off     -

Unlike dataset properties, most pool properties are set when you create or import the pool.

Virtual Devices

A virtual device (VDEV) is a group of storage devices. You might think of a VDEV as a RAID container: a big RAID-5 presents itself to the operating system as a huge device, even though the sysadmin knows it’s really a bunch of smaller disks. The virtual device is where ZFS’s magic happens. You can arrange pools for different levels of redundancy or abandon redundancy and maximize space.

ZFS’s automated error correction takes place at the VDEV level. Everything in ZFS, from znodes (index nodes) to data blocks, is checksummed to verify integrity. If your pool has sufficient redundancy, ZFS will notice that data is damaged and restore it from a good copy. If your pool lacks redundancy, ZFS will notify you that the data is damaged and you can restore from backup.

A zpool consists of one or more identical VDEVs. The pool stripes data across all the VDEVs, with no redundancy. The loss of a VDEV means the loss of the pool. If you have a pool with a whole bunch of disks, make sure to use redundant VDEVs.

VDEV Types and Redundancy

ZFS supports several different types of VDEV, each differentiated by the degree and style of redundancy they offer. The common mirrored disk, where each disk copies what’s on another disk, is one type of VDEV. Piles of disks with no redundancy is another type of VDEV. And ZFS includes three different varieties of sophisticated parity-based redundancy, called RAID-Z.

Using multiple VDEVs in a pool creates systems similar to advanced RAID arrays. A RAID-Z2 array looks an awful lot like RAID-6, but a ZFS pool with two RAID-Z2 VDEVs resembles RAID-60. Mirrored VDEVs work like RAID-1, but multiple mirrors in a pool behave like RAID-10. In both of these cases, ZFS stripes the data across the VDEV with no redundancy. The individual VDEVs provide the redundancy.

Choose your VDEV type carefully.

Striped VDEVs

A VDEV composed of a single disk is called a stripe and has no redundancy. Losing the disk means losing your data. While a pool can contain multiple striped VDEVs, each disk is its own VDEV. Much like RAID-0, losing one disk means losing the whole pool.

Mirror VDEVs

A mirror VDEV stores a complete copy of all the VDEV’s data on every disk. You can lose all but one of the drives in the VDEV and still access your data. A mirror can contain any number of disks.

ZFS can read data from all of the mirrored disks simultaneously, so reading data is fast. When you write data, though, ZFS must write that data to all of the disks simultaneously. The write isn’t complete until the slowest disk finishes. Write performance suffers.

RAID-Z

RAID-Z spreads data and parity information across all of the disks, much like conventional RAID. If a disk in a RAID-Z dies or starts giving corrupt data, RAID-Z uses the parity information to recalculate the missing data. A RAID-Z VDEV must contain at least three disks and can withstand the loss of any single disk. RAID-Z is sometimes called RAID-Z1.

You can’t add or remove disks in a RAID-Z. If you create a five-disk RAID-Z, it will remain a five-disk RAID-Z forever. Don’t go thinking you can add an additional disk to a RAID-Z for more storage. You can’t.

If you’re using disks over 2TB, there’s a nontrivial chance of a second drive failing as you repair the first drive. For large disks, you should probably consider RAID-Z2.

RAID-Z2

RAID-Z2 stripes parity and data across every disk in the VDEV, much like RAID-Z1, but doubles the amount of parity information. This means a RAID-Z2 can withstand the loss of up to two disks. You can’t add or remove disks from a RAID-Z2. It is slightly slower than RAID-Z.

A RAID-Z2 must have four or more disks.

RAID-Z3

Triple parity is for the most important data or those sysadmins with a whole bunch of disks and no time to fanny about. You can lose up to three disks in your RAID-Z3 without losing data. As with any other RAID-Z, you can’t add or remove disks from a RAID-Z3.

A RAID-Z3 must have five or more disks.

Log and Cache VDEVs

Pools can improve performance with special-purpose VDEVs. Only adjust or implement these if performance problems demand them; don’t add them proactively.1 Most people don’t need them, so I won’t go into details, but you should know they exist in case you get unlucky.

The Separate Intent Log (SLOG or ZIL) is ZFS’s filesystem journal. Pending writes get dumped to the SLOG and then arranged more properly in the primary pool. Every pool dedicates a chunk of disk space for a SLOG, but you can use a separate device for the SLOG instead. You need faster writes? Install a really fast drive and dedicate it to the SLOG. The pool will dump all its initial writes to the fast disk device and then migrate those writes to the slower media as time permits. A dedicated fast SLOG will also smooth out bursty I/O.

The Level 2 Adaptive Replacement Cache (L2ARC) is like the SLOG but for reads. ZFS keeps the most recently accessed and the most frequently accessed data in memory. By adding a really fast device as an L2ARC, you expand the amount of data ZFS can provide from cache instead of calling from slow disk. An L2ARC is slower than memory but faster than the slow disk.

RAID-Z and Pools

You can add VDEVs to a pool. You can’t add disks to a RAID-Z VDEV. Think about your storage needs and your hardware before creating your pools.

Suppose you have a server that can hold 20 hard drives, but you have only 12 drives. You create a single RAID-Z2 VDEV out of those 12 drives, thinking that you’ll add more drives to the pool later if you need them. You haven’t even finished installing the server, and already you’ve failed.

You can add multiple identical VDEVs to a pool. If you create a pool with a 12-disk VDEV, and the host can hold only another 8 disks, there’s no way to create a second identical VDEV. A 12-disk RAID-Z2 isn’t identical to an 8-disk RAID-Z2. You can force ZFS to accept the different VDEVs, but performance will suffer. Adding a VDEV to a pool is irreversible.

Plan ahead. Look at your physical gear. Decide how you will expand your storage. This 20-drive server would be fine with two 10-disk RAID-Z2 VDEVs, or one 12-disk pool and a separate 8-disk pool. Don’t sabotage yourself.

Once you know what sort of VDEV you want to use, you can create a pool.

Managing Pools

Now that you understand the different VDEV types and have indulged in planning your storage, let’s create some different types of zpools. Start by setting your disk block size.

ZFS and Disk Block Size

Chapter 10 covered how modern disks have two different sector sizes, 512 bytes and 4KB. While a filesystem can safely assume a disk has 4KB sectors, if your filesystem assumes the disk has 512-byte sectors and the disk really has 4KB sectors, your performance will plunge. ZFS, of course, assumes that disks have 512-byte sectors. If your disk really has 512-byte sectors, you’re good. If you’re not sure what size the physical sectors are, though, err on the side of caution and tell ZFS to use 4KB sectors. Control ZFS’s disk sector assumptions with the ashift property. An ashift of 9 tells ZFS to use 512-byte sectors, while an ashift of 12 indicates 4KB sectors. Control ashift with the sysctl vfs.zfs.min_auto_ashift.

# sysctl vfs.zfs.min_auto_ashift=12

Make this permanent by setting it in /etc/sysctl.conf.

You must set ashift before creating a pool. Setting it after pool creation has no effect.

If you’re not sure what size sectors your disks have, use an ashift of 12. That’s what the FreeBSD installer does. You’ll lose a small amount of performance, but using an ashift of 9 on 4KB disks will drain system performance.

Now create your pools.

Creating and Viewing Pools

Create a pool with the zpool create command.

# zpool create poolname vdevtype disks...

If the command succeeds, you get no output back.

Here, I create a pool named db, using a mirror VDEV and two GPT-labeled partitions:

# zpool create db mirror gpt/zfs3 gpt/zfs4

The structure we assign gets reflected in the pool status.

# zpool status db
--snip--
config:

        NAME          STATE     READ WRITE CKSUM
        db            ONLINE       0     0     0
         mirror-0    ONLINE       0     0     0
           gpt/zfs3  ONLINE       0     0     0
           gpt/zfs4  ONLINE       0     0     0
--snip--

The pool db contains a single VDEV, named mirror-0 . It includes two partitions with GPT labels, /dev/gpt/zfs3 and /dev/gpt/zfs . All of those partitions are online.

If you don’t include a VDEV name, zpool(8) creates a striped pool with no redundancy. Here, I create a striped pool called scratch:

# zpool create scratch gpt/zfs3 gpt/zfs4

The pool status shows each VDEV, named after the underlying disk.

--snip--
        NAME        STATE     READ WRITE CKSUM
        garbage     ONLINE       0     0     0
          gpt/zfs3  ONLINE       0     0     0
          gpt/zfs4  ONLINE       0     0     0
--snip--

Creating any type of RAID-Z looks much like creating a mirror. Just use the correct VDEV type.

# zpool create db raidz gpt/zfs3 gpt/zfs4 gpt/zfs5

The pool status closely resembles that of a mirror, but with more disks in the VDEV.

Multi-VDEV Pools

When you’re creating a pool, the keywords mirror, raidz, raidz2, and raidz3 all tell zpool(8) to create a new VDEV. Any disks listed after one of those keywords goes into creating a new VDEV. To create a pool with multiple VDEVs, you’d do something like this:

# zpool create poolname vdevtype disks... vdevtype disks...

Here, I create a pool containing two RAID-Z VDEVs, each with three disks:

# zpool create db raidz gpt/zfs3 gpt/zfs4 gpt/zfs5 raidz gpt/zfs6 gpt/zfs7 gpt/zfs8

A zpool status on this new pool will look a little different.

--snip--
        NAME          STATE     READ WRITE CKSUM
        db            ONLINE       0     0     0
         raidz1-0    ONLINE       0     0     0
            gpt/zfs3  ONLINE       0     0     0
            gpt/zfs4  ONLINE       0     0     0
            gpt/zfs5  ONLINE       0     0     0
         raidz1-1    ONLINE       0     0     0
            gpt/zfs6  ONLINE       0     0     0
            gpt/zfs7  ONLINE       0     0     0
            gpt/zfs8  ONLINE       0     0     0
--snip--

This pool contains a VDEV called raidz1-0 with three disks in it. There’s a second VDEV, named raidz1-1 , with three disks in it. It’s very clear that these are identical pools. Data gets striped across both VDEVs.

Destroying Pools

To destroy a pool, use zpool destroy and the pool name.

# zpool destroy db

Note that zpool doesn’t ask whether you’re really sure before destroying the pool. Being sure you want to destroy the pool is your problem, not zpool(8)’s.

Errors and -f

If you enter a command that doesn’t make sense, zpool(8) will complain.

# zpool create db raidz gpt/zfs3 gpt/zfs4 gpt/zfs5 raidz gpt/zfs6 gpt/zfs7
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: both 3-way and 2-way raidz vdevs are present

The first thing you see when reading the error message is “use -f to override this error.” Many sysadmins read this as “-f makes this problem go away.” What ZFS is really saying, though, is “Your command line is a horrible mistake. Add -f to do something unfixable, harmful to system stability, and that you’ll regret as long as this system lives.”

Most zfs(8) and zpool(8) error messages are meaningful, but you have to read them carefully. If you don’t understand the message, fall back on the troubleshooting instructions in Chapter 1. Often, reexamining what you typed will expose the problem.

In this example, I asked zpool(8) to create a pool with a RAID-Z VDEV containing three disks and a second RAID-Z VDEV containing only two disks. I screwed up this command line. Adding -f and proceeding to install my database to the new malformed db pool would only ensure that I have to recreate this pool and reinstall the database at a later date.2 If you find yourself in this situation, investigate zfs send and zfs recv.

Copy-On-Write

In both ordinary filesystems and ZFS, files exist as blocks on the disk. When you edit a file in a traditional filesystem, the filesystem picks up the block, modifies it, and sets it back down in the same place on the disk. A system problem halfway through that write can cause a shorn write: a file that’s 50 percent the old version, 50 percent the new version, and probably 100 percent unusable.

ZFS never overwrites the existing blocks in a file. When a file changes, ZFS identifies the blocks that must change and writes them to a new chunk of disk space. The old version is left intact. This is called copy-on-write (COW). With copy-on-write, a short write might lose the newest changes to the file, but the previous version of the file will remain intact.

Never corrupting files is a great benefit to copy-on-write, but COW opens up other possibilities. The metadata blocks are also copy-on-write, all the way up to the uberblocks that form the root of the ZFS pool’s data tree. ZFS creates snapshots by tracking the blocks that contain old versions of a file. While that sounds simple, the details are what will lead you astray.

Snapshots

A snapshot is a copy of a dataset as it existed at a specific instant. Snapshots are read-only and never change. You can access the contents of a snapshot to access older versions of files or even deleted files. While snapshots are read-only, you can roll the dataset back to the snapshot. Take a snapshot before upgrading a system, and if the upgrade goes horribly wrong, you can fall back to the snapshot. ZFS uses snapshots to provide many features, such as boot environments (see “Boot Environments” on page 276). Best of all, depending on your data, snapshots can take up only tiny amounts of space.

Every dataset has a bunch of metadata, all built as a tree from a top-level block. When you create a snapshot, ZFS duplicates that top-level block. One of those metadata blocks goes with the dataset, while the other goes with the snapshot. The dataset and the snapshot share the data blocks within the dataset.

Deleting, modifying, or overwriting a file on the live dataset means allocating new blocks for the new data and disconnecting blocks containing the old data. Snapshots need some of those old data blocks, however. Before discarding an old block, ZFS checks to see whether a snapshot still needs it. If a snapshot needs a block, but the dataset no longer does, ZFS keeps the block.

So, a snapshot is merely a list of which blocks the dataset used at the time the snapshot was taken. Creating a snapshot tells ZFS to preserve those blocks, even if the dataset no longer needs those blocks.

Creating Snapshots

Use the zfs snapshot command to create snapshots. Specify the dataset by its full path, then add @ and a snapshot name. I habitually name my snapshots after the date and time I create the snapshot, for reasons that will become clear by the end of this chapter.

I’m about to do maintenance on user home directories, removing old stuff to free up space. I’m pretty sure that someone will whinge about me removing their files,3 so I want to create a snapshot before cleaning up.

# zfs snapshot zroot/usr/home@2018-07-21-13:09:00

I don’t get any feedback. Did anything happen? View all your snapshots with the -t snapshot argument to zfs list.

# zfs list -t snapshot
NAME                                    USED  AVAIL    REFER     MOUNTPOINT
zroot/usr/home@2018-07-21-13:09:00       0     -   4.68G   -

The snapshot exists. The USED column shows that it uses zero disk space : it’s identical to the dataset it came from. As snapshots are read-only, available space shown by AVAIL is just not relevant. The REFER column shows that this snapshot pulls in 4.68GB of disk space . If you check, you’ll see that’s the size of zroot/usr/home. Finally, the MOUNTPOINT column shows that this snapshot isn’t mounted .

This is an active system, and other people are logged into it. I wait a moment and check my snapshots again.

# zfs list -t snapshot
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
zroot/usr/home@2018-07-21-13:09:00      96K      -  4.68G  -

The snapshot now uses 96KB . A user changed something on the dataset, and the snapshot gets charged with the space needed to maintain the difference.

Now I go on my rampage, and get rid of the files I think are garbage.

# zfs list -t snapshot
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
zroot/usr/home@2018-07-21-13:09:00     1.62G      -  4.68G  -

This snapshot now uses 1.62GB of space. Those are files that I’ve deleted but that are still available in the snapshot. I’ll keep this snapshot for a little while to give the users a chance to complain.

Accessing Snapshots

Every ZFS dataset has a hidden .zfs directory in its root. It won’t show up in ls(1); you have to know it exists. That directory has a snapshot directory, which contains a directory named after each snapshot. The contents of the snapshot are in that directory.

For our snapshot zroot/usr/home@2018-07-21-13:09:00, we’d go to /usr/home/.zfs/snapshot/2018-07-21-13:09:00. While the .zfs directory doesn’t show up in ls(1), once you’re in it, ls(1) works normally. That directory contains every file as it existed when I created the snapshot, even if I’ve deleted or changed that file since creating that snapshot.

Recovering a file from the snapshot requires only copying the file from the snapshot to a read-write location.

Destroying Snapshots

A snapshot is a dataset, just like a filesystem-style dataset. Remove it with zfs destroy.

# zfs destroy zroot/usr/home@2017-07-21-13:09:00

The space used by the snapshot is now available for more junk files.

Compression

Snapshots aren’t the only way ZFS can save space. ZFS uses on-the-fly compression, transparently inspecting the contents of each file and squeezing its size if possible. With ZFS, your programs don’t need to compress their log files: the filesystem will do it for you in real time. While FreeBSD enables compression by default at install time, you’ll use it more effectively if you understand how it works.

Compression changes system performance, but probably not in the way you think it would. You’ll need CPU time to compress and decompress data as it goes to and from the disk. Most disk requests are smaller than usual, however. You essentially exchange processor time for disk I/O. Every server I manage, whether bare metal or virtual, has far, far more processor capacity than disk I/O, so that’s a trade I’ll gleefully make. The end result is that using ZFS compression most often increases performance.

Compression works differently on different datasets. Binary files are already pretty tightly compressed; compressing /usr/bin doesn’t save much space. Compressing /var/log, though, often results in reducing file size by a factor of six or seven. Check the property compressratio to see how effectively compression shrinks your data. My hosts write to logs far more often than they write binaries. I’ll gleefully accept a sixfold performance increase for the most common task.

ZFS supports many compression algorithms, but the default is lz4. The lz4 algorithm is special in that in quickly recognizes incompressible files. When you write a binary to disk, lz4 looks at it and says, “Nope, I can’t help you,” and immediately quits trying. This eliminates pointless CPU load. It effectively compresses files that can be compressed, however.

Pool Integrity and Repair

Every piece of data in a ZFS pool has an associated cryptographic hash stored in its metadata to verify integrity. Every time you access a piece of data, ZFS recomputes the hash of every block in that data. When ZFS discovers corrupt data in a pool with redundancy, it transparently corrects that data and proceeds. If ZFS discovers corrupt data in a pool without redundancy, it gives a warning and refuses to serve the data. If your pool has identified any data errors, they’ll show up in zpool status.

Integrity Verification

In addition to the on-the-fly verification, ZFS can explicitly walk the entire filesystem tree and verify every chunk of data in the pool. This is called a scrub. Unlike UFS’s fsck(8), scrubs happen while the pool is online and in use. If you’ve previously run a scrub, that will also show up in the pool status.

  scan: scrub repaired 0 in 8h3m with 0 errors on Fri Jul 21 14:17:29 2017

To scrub a pool, run zpool scrub and give the pool name.

# zpool scrub zroot

You can watch the progress of the scrub with zpool status.

Scrubbing a pool reduces its performance. If your system is already pushing its limits, scrub pools only during off hours. You can cancel a scrub4 with the -s option.

# zpool scrub -s zroot

Run another scrub once the load drops.

Repairing Pools

Disks fail. That’s what they’re for. The point of redundancy is that you can replace failing or flat-out busted disks with working disks and restore redundancy.

Mirror and RAID-Z virtual devices are specifically designed to reconstruct the data lost when a disk fails. They’re much like RAID in that regard. If one disk in a ZFS mirror dies, you replace the dead disk, and ZFS copies the surviving mirror onto the new disk. If a disk in a RAID-Z VDEV fails, you replace the busted drive, and ZFS rebuilds the data on that disk from parity data.

In ZFS, this reconstruction is called resilvering. Like other ZFS integrity operations, resilvering takes place only on live filesystems. Resilvering isn’t quite like rebuilding a RAID disk from parity, as ZFS leverages its knowledge of the filesystem to optimize repopulating the replacement device. Resilvering begins automatically when you replace a failed device. ZFS resilvers at a low priority so that it doesn’t interfere with normal operations.

Pool Status

The zpool status command shows the health of the underlying storage hardware in the STATE field. We’ve seen a couple examples of healthy pools, so let’s take a look at an unhealthy pool.

# zpool status db
  pool: db
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        db                          DEGRADED     0     0     0
        mirror-0                   DEGRADED     0     0     0
            gpt/zfs1                ONLINE       0     0     0
          14398195156659397932   UNAVAIL      0     0     0   was /dev/gpt/zfs3

errors: No known data errors

The pool state is DEGRADED . If you look further down the output, you’ll see more DEGRADED entries and an UNAVAIL . What exactly does that mean?

Errors in a pool percolate upward. The pool state is a summary of the health of the pool as a whole. The whole pool shows up as DEGRADED because the pool’s virtual device mirror-0 is DEGRADED. This error comes from an underlying disk being in the UNAVAIL state. We get the ZFS GUID for this disk, and the label used to create the pool .

ZFS pools show an error when an underlying device has an error. When a pool has a state other than ONLINE, dig through the VDEV and disk listings until you find the real problem.

Pools, VDEVs, and disks can have six states:

ONLINE The device is functioning normally.

DEGRADED The pool or VDEV has at least one provider missing, offline, or generating errors more quickly than ZFS tolerates. Redundancy is handling the error, but you need to address this right now.

FAULTED A faulted disk is corrupt or generating errors more quickly than ZFS can tolerate. A faulted VDEV takes the last known good copy of the data. A two-disk mirror with two bad disks faults.

UNAVAIL ZFS can’t open the disk. Maybe it’s been removed, shut off, or that iffy cable finally failed. It’s not there, so ZFS can’t use it.

OFFLINE This device has been deliberately turned off.

REMOVED Some hardware detects when a drive is physically removed while the system is running, letting ZFS set the REMOVED flag. When you plug the drive back in, ZFS tries to reactivate the disk.

Our missing disk is in the UNAVAIL state. For whatever reason, ZFS can’t access /dev/gpt/zfs3, but the disk mirror is still serving data because it has a working disk. Here’s where you get to run around to figure out where that disk went. How you manage ZFS depends on what you discover.

Reattaching and Detaching Drives

Unavailable drives might not be dead. They might be disconnected. If you wiggle a drive tray and suddenly get a green light, the disk is fine but the connection is faulty. You should address that hardware problem, yes, but in the meantime, you can reactivate the drive. You can also reactivate deliberately removed drives. Use the zpool online command with the pool name and the GUID of the missing disk as arguments. If the disk in my example pool were merely disconnected, I could reactivate it like so:

# zpool online db 14398195156659397932

ZFS resilvers the drive and resumes normal function.

If you want to remove a drive, you can tell ZFS to take it offline with zpool offline. Give the pool and disk names as arguments.

# zpool offline db gpt/zfs6

Bringing disks offline, physically moving them, bringing them back online, and allowing the pools to resilver will let you migrate large storage arrays from one SAS cage to another without downtime.

Replacing Drives

If the drive isn’t merely loose but flat-out busted, you’ll need to replace it with a new drive. ZFS lets you replace drives in several ways, but the most common is using zpool replace. Use the pool name, the failed provider, and the new provider as arguments. Here, I replace the db pool’s /dev/gpt/zfs3 disk with /dev/gpt/zfs6:

# zpool replace db gpt/zfs3 gpt/zfs6

The pool will resilver itself and resume normal operation.

In a large storage array, you can also use successive zpool replace operations to empty a disk shelf. Only do this if your organization’s operation requirements don’t allow you to offline and online disks.

Boot Environments

ZFS helps us cope with one of the most dangerous things sysadmins do. No, not our eating habits. No, not a lack of exercise. I’m talking about system upgrades. When an upgrade goes well, everybody’s happy. When the upgrade goes poorly, it can ruin your day, your weekend, or your job. Nobody likes restoring from backup when the mission-critical software chokes on the new version of a shared library. Nobody likes to restore from backup.

Through the magic of boot environments, ZFS takes advantage of snapshots to let you fall back from a system upgrade with only a reboot. A boot environment is a clone of the root dataset. It includes the kernel, the base system userland, the add-on packages, and the core system databases. Before running an upgrade, create a boot environment. If the upgrade goes well, you’re good. If the upgrade goes badly, though, you can reboot into the boot environment. This restores service while you investigate how the upgrade failed and what you can do to fix those problems.

Boot environments do not work when a host requires a separate boot pool. The installer handles boot pools for you. They appear when combining UEFI and GELI, or when using ZFS on an MBR-partitioned disk.

Using boot environments requires a boot environment manager. I recommend beadm(8), available as a package.

# pkg install beadm

You’re now ready to use boot environments.

Viewing Boot Environments

Each boot environment is a dataset under zroot/ROOT. A system where you’ve just installed beadm should have only one boot environment. Use beadm list to view them all.

  # beadm list
  BE        Active    Mountpoint  Space   Created
 default  NR      /          2.4G   2018-05-04 13:13

This host has one boot environment, named default , after the dataset zroot/ROOT/default.

The Active column shows whether this boot environment is in use. An N means that the environment is now in use. An R means that this environment will be active after a reboot. They appear together when the default environment is running.

The Mountpoint column shows the location of this boot environment’s mount point. Most boot environments aren’t mounted unless they’re in use, but you can use beadm(8) to mount an unused boot environment.

The Space column shows the amount of disk space this boot environment uses. It’s built on a snapshot, so the dataset probably has more data than this amount in it.

The Created column shows the date this boot environment was created. In this case, it’s the date the machine was installed.

Before changing the system, create a new boot environment.

Creating and Accessing Boot Environments

Each boot environment needs a name. I recommend names based on the current operating system version and patch level or the date. Names like “beforeupgrade” and “dangitall,” while meaningful in the moment, will only confuse you later.

Use beadm create to make your new boot environment. Here, I check the current FreeBSD version, and use that to create the boot environment name:

# freebsd-version
11.0-RELEASE-p11
# beadm create 11.0-p11
Created successfully

I now have two identical boot environments.

# beadm list
BE  a       Active Mountpoint  Space Created
default    NR     /           12.3G 2015-04-28 11:53
11.0-p11   -      -          236.0K 2018-07-21 14:57

You might notice that the new boot environment already takes up 236KB. This is a live system. Between when I created the boot environment and when I listed those environments, the filesystem or its metadata changed.

The Active column shows that we’re currently using the default boot environment and that we’ll be using that on the next boot. If I change my installed packages or upgrade the base system, those changes will affect the default environment.

Each boot environment is available as a snapshot under zroot/ROOT. If you want to access a boot environment read-write, use beadm mount to temporarily mount the boot environment under /tmp. Unmount those environments with beadm umount.

Activating Boot Environments

Suppose you upgrade your packages and the system goes belly-up. Fall back to an earlier operating system install by activating a boot environment and rebooting. Activate a boot environment with beadm activate.

# beadm activate 11.0-p11
Activated successfully
# beadm list
BE         Active Mountpoint  Space Created
default    N      /           12.4G 2015-04-28 11:53
11.0-p11   R      -          161.8M 2018-07-21 14:57

The default boot environment has its Active flag set to N, meaning it’s now running. The 11.0-p11 environment has the R flag, so after a reboot it will be live.

Reboot the system and suddenly you’ve fallen back to the previous operating system install, without the changes that destabilized your system. That’s much simpler than restoring from backup.

Removing Boot Environments

After a few upgrades, you’ll find that you’ll never fall back to some of the existing boot environments. Once I upgrade this host to, say, 12.2-RELEASE-p29, chances are I’ll never ever reboot into 11.0-p11 again. Remove obsolete boot environments and free up their disk space with beadm destroy.

# beadm destroy 11.0-p11
Are you sure you want to destroy '11.0-p11'?
This action cannot be undone (y/[n]): y
Destroyed successfully

Answer y when prompted, and beadm will remove the boot environment.

Boot Environments at Boot

So you’ve truly hosed your operating system. Forget getting to multiuser mode, you can’t even hit single-user mode without generating a spew of bizarre error messages. You can select a boot environment right at the loader prompt. This requires console access, but so would any other method of rescuing yourself.

The boot loader menu includes an option to select a boot environment. Choose that option. You’ll get a new menu listing every boot environment on the host by name. Choose your new boot environment and hit ENTER. The system will boot into that environment, giving you a chance to figure out why everything went sideways.

Boot Environments and Applications

It’s not enough that your upgrade failed. It might take your application data with it.

Most applications store their data somewhere in the root dataset. MySQL uses /var/db/mysql, while Apache uses /usr/local/www. This means that falling back to an earlier boot environment can revert your application data with the environment. Depending on your application, you might want that reversion—or not.

If an application uses data that shouldn’t be included in the boot environment, you need to create a new dataset for that data. I provided an example in “Unmounted Parent Datasets” on page 262 earlier this chapter. Consider your application’s need and separate out your data as appropriate.

While ZFS has many more features, this covers the topics every sysadmin must know. Many of you would find clones, delegations, or replication useful. You might find the books FreeBSD Mastery: ZFS (Tilted Windmill Press, 2015) and FreeBSD Mastery: Advanced ZFS (Tilted Windmill Press, 2016) by Allan Jude and yours truly helpful. You’ll also find many resources on the internet documenting all of these topics.

Now let’s consider some other filesystems FreeBSD administrators find useful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset