Chapter 15. The Page Cache and Page Writeback

The linux kernel implements a primary disk cache called the page cache. The goal of this cache is to minimize disk I/O by storing in physical memory data that would otherwise be accessed from disk. This chapter deals with the page cache and page writeback.

Disk caches are beneficial for two reasons. First, disk access is magnitudes slower than memory access. Accessing data from memory rather than the disk is much faster. Second, data accessed once will, with a high likelihood, find itself accessed again in the near future. This principle, that access to a particular piece of data tends to be clustered in time, is called temporal locality. Temporal locality ensures that if data is cached on its first access, there is a high probability of a cache hit (access to data that is in the cache) in the near future.

The page cache consists of physical pages in RAM. Each page in the cache corresponds to multiple blocks on the disk. Whenever the kernel begins a page I/O operation (a disk operation in page-size chunks, usually to a regular file), it first checks whether the requisite data is in the page cache. If it is, the kernel can forego accessing the disk and use the data straight from the page cache.

Individual disk blocks can also tie into the page cache, by way of block I/O buffers. Recall from Chapter 13, “The Block I/O Layer,” that a buffer is the in-memory representation of a single physical disk block. Buffers act as descriptors that map pages in memory to disk blocks; thus, the page cache also reduces disk access during block I/O operations by both caching disk blocks and buffering block I/O operations until later. This caching is often referred to as the “buffer cache,” although in reality it is not a separate cache and is part of the page cache.

Let's look at the sort of operations and data that end up in the page cache. The page cache is primarily populated by page I/O operations, such as read() and write(). Page I/O operations manipulate entire pages of data at a time; this entails operations on more than one disk block. Consequently, the page cache caches page-size chunks of files.

Block I/O operations manipulate a single disk block at a time. A common block I/O operation is reading and writing inodes. The kernel provides the bread() function to perform a low-level read of a single block from disk. Via buffers, disk blocks are mapped to their associated in-memory pages and, thus, cached in the page cache.

For example, when you first open a source file in a text editor, data from the file is read into memory from disk. As you edit the file, more and more pages are read in. When you later compile the file, the kernel can use the pages directly from the page cache; it need not reread the file from disk. Because users tend to read and manipulate the same files repeatedly, the page cache reduces the need for a large number of disk operations.

Page Cache

The page cache, as its name suggests, is a cache of pages. The pages originate from reads and writes of regular filesystem files, block device files, and memory-mapped files. In this manner, the page cache contains entire pages from recently accessed files. During a page I/O operation, such as read()[1], the kernel checks whether the data resides in the page cache. If the data is in the page cache, the kernel can quickly return the requested page rather than read the data off the disk.

The address_space Object

A physical page might comprise multiple noncontiguous physical blocks[2]. Checking the page cache to see whether certain data has been cached is rendered more difficult because of the noncontiguous nature of the blocks that constitute each page. Therefore, it is not possible to index the data in the page cache using only a device name and block number, which would otherwise be the simplest solution.

Furthermore, the Linux page cache is quite general in what pages it can cache. Indeed, the original page cache introduced in System V Release 4 cached only filesystem data. Consequently, the SVR4 page cache used its equivalent of the file object (called struct vnode) to manage the page cache. The Linux page cache aims to cache any page-based object, which includes many forms of files and memory mappings.

To remain generic, the Linux page cache uses the address_space structure to identify pages in the page cache. This structure is defined in <linux/fs.h>:

struct address_space {
        struct inode            *host;              /* owning inode */
        struct radix_tree_root  page_tree;          /* radix tree of all pages */
        spinlock_t              tree_lock;          /* page_tree lock */
        unsigned int            i_mmap_writable;    /* VM_SHARED ma count */
        struct prio_tree_root   i_mmap;             /* list of all mappings */
        struct list_head        i_mmap_nonlinear;   /* VM_NONLINEAR ma list */
        spinlock_t              i_mmap_lock;        /* i_mmap lock */
        atomic_t                truncate_count;     /* truncate re count */
        unsigned long           nrpages;            /* total number of pages */
        pgoff_t                 writeback_index;    /* writeback start offset */
        struct address_space_operations   *a_ops;   /* operations table */
        unsigned long           flags;              /* gfp_mask and error flags */
        struct backing_dev_info *backing_dev_info;  /* read-ahead information */
        spinlock_t              private_lock;       /* private lock */
        struct list_head        private_list;       /* private list */
        struct address_space    *assoc_mapping;     /* associated buffers */
};

The i_mmap field is a priority search tree of all shared and private mappings in this address space. A priority search tree is a clever mix of heaps and radix trees[3].

There are a total of nrpages in the address space.

The address_space is associated with some kernel object. Normally, this is an inode. If so, the host field points to the associated inode. The host field is NULL if the associated object is not an inode; for example, if the address_space is associated with the swapper.

The a_ops field points to the address space operations table, in the same manner as the VFS objects and their operations tables. The operations table is represented by struct address_space_operations and is also defined in <linux/fs.h>:

struct address_space_operations {
        int (*writepage)(struct page *, struct writeback_control *);
        int (*readpage) (struct file *, struct page *);
        int (*sync_page) (struct page *);
        int (*writepages) (struct address_space *, struct writeback_control *);
        int (*set_page_dirty) (struct page *);
        int (*readpages) (struct file *, struct address_space *,
                          struct list_head *, unsigned);
        int (*prepare_write) (struct file *, struct page *, unsigned, unsigned);
        int (*commit_write) (struct file *, struct page *, unsigned, unsigned);
        sector_t (*bmap)(struct address_space *, sector_t);
        int (*invalidatepage) (struct page *, unsigned long);
        int (*releasepage) (struct page *, int);
        int (*direct_IO) (int, struct kiocb *, const struct iovec *,
                          loff_t, unsigned long);
};

The readpage() and writepage() methods are most important. Let's look at the steps involved in a page read operation.

First, the readpage() method is passed an address_space plus offset pair. These values are used to search the page cache for the desired data:

page = find_get_page(mapping, index);

Here, mapping is the given address space and index is the desired position in the file.

If the page does not exist in the cache, a new page is allocated and added to the page cache:

struct page *cached_page;
int error;

cached_page = page_cache_alloc_cold(mapping);
if (!cached_page)
        /* error allocating memory */
error = add_to_page_cache_lru(cached_page, mapping, index, GFP_KERNEL);
if (error)
        /* error adding page to page cache */

Finally, the requested data can be read from disk, added to the page cache, and returned to the user:

error = mapping->a_ops->readpage(file, page);

Write operations are a bit different. For file mappings, whenever a page is modified, the VM simply calls

SetPageDirty(page);

The kernel later writes the page out via the writepage() method. Write operations on specific files are more complicated. Basically, the generic write path in mm/filemap.c performs the following steps:

page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
status = a_ops->prepare_write(file, page, offset, offset+bytes);
page_fault = filemap_copy_from_user(page, offset, buf, bytes);
status = a_ops->commit_write(file, page, offset, offset+bytes);

First, the page cache is searched for the desired page. If it is not in the cache, an entry is allocated and added. Next, the prepare_write()method is called to set up the write request. The data is then copied from user-space into a kernel buffer. Finally, the data is written to disk via the commit_write() function.

Because the previous steps are performed during all page I/O operations, all page I/O is guaranteed to go through the page cache. Consequently, the kernel attempts to satisfy all read requests from the page cache. If this fails, the page is read in from disk and added to the page cache. For write operations, the page cache acts as a staging ground for the writes. Therefore, all written pages are also added to the page cache.

Radix Tree

Because the kernel must check for the existence of a page in the page cache before initiating any page I/O, such a check must be quick. Otherwise, the overhead of searching and checking the page cache could nullify any benefits the cache might provide (at least if the cache hit rate is low—the overhead would have to be awful to cancel out the benefit of retrieving the data from memory in lieu of disk).

As you saw in the previous section, the page cache is searched via the address_space object plus an offset value. Each address_space has a unique radix tree stored as page_tree. A radix tree is a type of binary tree. The radix tree allows very quick searching for the desired page, given only the file offset. Page cache searching functions such as find_get_page() call radix_tree_lookup(), which performs a search on the given tree for the given object.

The core radix tree code is available in generic form in lib/radix-tree.c. Users of the radix tree need to include <linux/radix-tree.h>.

The Old Page Hash Table

Prior to the 2.6 kernel, the page cache was not searched via radix tree. Instead, a global hash was maintained over all the pages in the system. The hash returned a doubly linked list of entries that hash to the same given value. If the desired page was in the cache, one of the items in the list was the corresponding page. Otherwise, the page was not in the page cache and the hash function returned NULL.

The global hash had four primary problems:

  • A single global lock protected the hash. Lock contention was quite high on even moderately sized machines, and performance suffered as a result.

  • The hash was larger than necessary because it contained all the pages in the page cache, whereas only pages pertaining to the current file were relevant.

  • Performance when the hash lookup failed (that is, the given page was not in the page cache) was slower than desired, particularly because it was necessary to walk the chains off of a given hash value.

  • The hash consumed more memory than other possible solutions.

The introduction of a radix tree-based page cache in 2.6 solved these issues.

The Buffer Cache

Linux no longer has a distinct buffer cache. Way back in the 2.2 kernel, there were two separate disk caches: the page cache and the buffer cache. The former cached pages; the latter cached buffers. The two caches were not unified in the least; a disk block could exist in both caches simultaneously. This led to extensive effort in synchronization between the two cached copies—not to mention wasted memory.

This was the case in the 2.2 Linux kernel and earlier, but starting with the 2.4 Linux kernel the two caches were unified. Today, we have one disk cache: the page cache.

The kernel still needs to use buffers, however, to represent disk blocks in memory. Thankfully, the buffers describe the mapping of a block onto a page, which is in the page cache.

The pdflush Daemon

Write operations are deferred in the page cache. When data in the page cache is newer than the data on the backing store, that data is called dirty. Dirty pages that accumulate in memory eventually need to be written back to disk. Dirty page writeback occurs in two situations:

  • When free memory shrinks below a specified threshold, the kernel must write dirty data back to disk to free memory.

  • When dirty data grows older than a specific threshold, sufficiently old data is written back to disk, to ensure that dirty data does not remain dirty indefinitely.

These two jobs have rather different goals. In fact, in the older kernel they were performed by two separate kernel threads (see the following section). In 2.6, however, a gang[4] of kernel threads, the pdflush background writeback daemons (or, simply, the pdflush threads), performs these jobs. Rumor has it that pdflush is short for “dirty page flush.” Ignore the confusing name; let's look at each of these goals in more detail.

First, the pdflush threads need to flush dirty data to disk when the amount of free memory in the system shrinks beyond a specified level. The goal of this background writeback is to regain memory from dirty pages when available physical memory is low. The specified memory level is configurable by the dirty_background_ratio sysctl. When free memory drops below this threshold, the kernel invokes the wakeup_bdflush() [5] call to wake up a pdflush thread and have it run the background_writeout() function to begin writeback of dirty pages. This function takes a lone parameter, which is the number of pages to attempt to write back. The function continues writing out data until two conditions are true:

  • The specified minimum number of pages has been written out.

  • The amount of free memory is above the dirty_background_ratio threshold.

These conditions ensure that pdflush does its part to relieve low-memory conditions. Writeback stops prior to these conditions only if pdflush writes back all the dirty pages and there is nothing left to do.

For its second goal, pdflush periodically wakes up (unrelated to low memory conditions) and writes out very old dirty pages. This is done to ensure that no dirty pages remain in memory indefinitely. During a system failure, because memory is volatile, dirty pages in memory that have not been written to disk are lost. Consequently, periodically synchronizing the page cache with the disk is important. On system boot, a timer is initialized to wake up a pdflush thread and have it run the wb_kupdate() function. This function then writes back all data that was modified longer than dirty_expire_ centisecs hundredths of a second ago. The timer is then reinitialized to expire again in dirty_writeback_centisecs hundredths of a second. In this manner, the pdflush threads periodically wake up and write to disk all dirty pages that are older than a specified limit.

The system administrator may set these values either in /proc/sys/vm or via sysctl. Table 15.1 lists the variables.

Table 15.1. pdflush Settings

Variable

Description

dirty_background_ratio

As a percentage of total memory, the number of pages at which the pdflush threads will begin writeback of dirty data.

dirty_expire_centisecs

In hundredths of a second, how old data must be to be written out next time a pdflush thread wakes to perform periodic writeback.

dirty_ratio

As a percentage of total memory, the number of pages a process generates before it begins writeback of dirty data.

dirty_writeback_centisecs

In hundredths of a second, how often the pdflush threads should wake up to write data back out to disk.

laptop_mode

A Boolean value controlling laptop mode. See the following section.

The pdflush code lives in mm/pdflush.c and the writeback mechanism lives in mm/page-writeback.c and fs/fs-writeback.c.

Laptop Mode

Laptop mode is a special page writeback strategy intended to optimize battery life by minimizing hard disk activity and allowing hard drives to remain spun down as long as possible. It is configurable via /proc/sys/vm/laptop_mode. By default, this file contains a zero and laptop mode is disabled. Writing a one to this file enables laptop mode.

Laptop mode makes a single change to page writeback behavior. In addition to performing writeback of dirty pages when they grow too old, pdflush also piggybacks off any other physical disk I/O, flushing all dirty buffers to disk. In this manner, pdflush takes advantage of the fact that the disk was just spun up, ensuring that it will not cause the disk to spin up later.

This behavioral change makes the most sense when dirty_expire_centisecs and dirty_writeback_centisecs are set to large values—say, 10 minutes. With writeback so delayed, the disk is spun up infrequently—and when it does spin up, laptop mode ensures that the opportunity is well utilized.

Many Linux distributions automatically enable and disable laptop mode, as well as modify other pdflush tunables, when going on and off battery. This would allow a machine to benefit from laptop mode when on battery power and then automatically return to normal page writeback behavior when plugged into AC.

bdflush and kupdated

Prior to the 2.6 kernel, the job of the pdflush threads was met by two other kernel threads, bdflush and kupdated.

The bdflush kernel thread performed background writeback of dirty pages when available memory was low. A set of thresholds was maintained, similar to pdflush, and bdflush was awakened via wakeup_bdflush() whenever free memory dropped below those thresholds.

Two main differences distinguish bdflush and pdflush. The first, which is discussed in the next section, is that there was always only one bdflush daemon, whereas the number of pdflush threads is dynamic. The second difference is that bdflush was buffer-based; it wrote back dirty buffers. Conversely, pdflush is page-based; it writes back whole pages. Of course, the pages may correspond to buffers, but the actual I/O unit is a full page and not a single buffer. This is beneficial as managing pages is easier than managing buffers, because pages are a more general and common unit.

Because bdflush flushes buffers only when memory is low or the number of buffers is too large, the kupdated thread was introduced to periodically write back dirty pages. It served an identical purpose to pdflush's wb_kupdate() function.

Both the bdflush kernel threads and their functionality were replaced by the pdflush threads.

Congestion Avoidance: Why We Have Multiple Threads

One of the major flaws in the bdflush solution was that bdflush consisted of one thread. This led to possible congestion during heavy page writeback where the single bdflush thread would block on a single congested device queue (the list of I/O requests waiting to submit to disk), while other device queues would sit relatively idle. If the system has multiple disks and the associated processing power, the kernel should be able to keep each disk busy. Unfortunately, even with plenty of data needing writeback, bdflush can become stuck handling a single queue and fail to keep all disks saturated. This occurs because the throughput of disks is a finite—and unfortunately rather small—number. If only a single thread is performing page writeback, that single thread can easily spend a long time waiting for a single disk because disk throughput is such a limiting quantity. To mitigate this, the kernel needs to multithread page writeback. In this manner, no single device queue can become a bottleneck.

The 2.6 kernel solves this problem by allowing multiple pdflush threads to exist. Each thread individually flushes dirty pages to disk, allowing different pdflush threads to concentrate on different device queues.

The number of threads changes throughout the uptime of a system, according to a simple algorithm. If all existing pdflush threads are busy for at least one second, a new pdflush thread is created. The total number of threads cannot exceed MAX_PDFLUSH_THREADS, which by default is eight. Conversely, if a pdflush thread is asleep for more than a second, it is terminated. The minimum number of threads is at least MIN_PDFLUSH_THREADS, which by default is two. In this manner, the number of pdflush threads adjusts dynamically depending on the amount of page writeback and congestion. If all existing pdflush threads are busy writing back data, a new thread is created. This ensures that a single device queue is not congested while other, less busy, device queues sit around needing but not receiving data writeback. If the congestion diminishes, however, the number of pdflush threads is scaled back to conserve memory.

This is all well and good, but what if each pdflush thread were to get hung up writing to the same, congested, queue? In that case, the performance of multiple pdflush threads would not be much improved over a single thread. The memory wasted, however, would be significantly greater. To mitigate this effect, the pdflush threads employ congestion avoidance: They actively try to write back pages whose queues are not congested. As a result, the pdflush threads spread out their work and refrain from merely hammering on the same busy device. When the pdflush threads are “busy”—and thus, a new thread is spawned—they are truly busy.

Because of the improvements in page writeback, including the introduction of pdflush, the 2.6 kernel is capable of keeping many more disks saturated than any earlier kernel. In the face of heavy activity, the pdflush threads can maintain high throughput across multiple disks.

To Make a Long Story Short

This chapter looked at the page cache and page writeback. You saw how the kernel performs all page I/O through the page cache and how writes are deferred in the page cache and eventually written out to disk via the pdflush gang of kernel threads.

Over the last few chapters, you have built a solid understanding of memory and filesystem management. Now let's segue over to the topic of modules, and see how the Linux kernel provides a modular and dynamic infrastructure for the run-time insertion and removal of kernel code.

 



[1] As you saw in Chapter 12, “The Virtual Filesystem,” it is not the read() and write() system calls that perform the actual page I/O operation, but the filesystem-specific methods specified by file->f_op->read() and file->f_op->write().

[2] For example, a physical page is 4KB in size on the x86 architecture, whereas a disk block on most filesystems can be as small as 512 bytes. Therefore, 8 blocks might fit in a single page. The blocks need not be contiguous because the files themselves might be laid out all over the disk.

[3] The kernel implementation is based on the radix priority search tree proposed by Edward M. McCreight in SIAM Journal of Computing, volume 14, number 2, pages 257–276, May 1985.

[4] The term “gang” is commonly used in computer science to denote a group of things that can operate in parallel.

[5] Yes, it is misnamed. It should be wakeup_pdflush(). See the following section for the heritage of this call.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset