The virtual memory subsystem of Linux is, without any doubt, the most complex and performance-critical component of the whole kernel.
In previous chapters, we explained how the kernel handles dynamic memory by keeping track of free and busy page frames. We have also discussed how every process in User Mode has its own linear address space so that page frames can be assigned to the process at the very last possible moment. Finally, we have also described how dynamic memory is used to cache the data of the slow block devices.
In this chapter, we complete our description of the virtual memory subsystem by discussing page frame reclaiming. As we saw in Chapter 14, the cache systems grab more and more page frames but never release any of them. This is reasonable because cache systems don’t know if and when processes will reuse some of the cached data and are therefore unable to identify the portions of cache that should be released. Moreover, thanks to the demand paging mechanism described in Chapter 8, User Mode processes get page frames as long as they proceed with their execution; however, demand paging has no way to force processes to release the page frames whenever they are no longer used. Page frame reclaiming is a remedy for this problem.
The kernel developers’ worst nightmare is to encounter a situation in which no free page frame exists. When this happens, the kernel might be easily trapped in a deadly chain of memory requests—to free a page frame, the kernel must write its data to disk. However, to accomplish this operation, the kernel requires another page frame (for instance, to allocate the buffer heads for the I/O data transfer). Since no free page frame exists, no page frame can be freed. In this situation, there is just one solution: kill a victim User Mode process to reclaim the page frames it was using. Of course, even if this solution avoids a system crash, it is not very satisfying for the end users.
The goal of page frame reclaiming is to conserve a minimal pool of free page frames so that the kernel may safely recover from “low on memory” conditions. To do this, it must neither trash the disk caches nor penalize User Mode processes too much, otherwise system performances will be greatly reduced. As a matter of fact, the hardest job of a developer working on the virtual memory subsystem consists of finding an algorithm that ensures acceptable performances both to desktop machines (on which memory requests are quite limited) and to high-level machines like large database servers (on which memory requests tend to be huge).
Unfortunately, finding a good page frame reclaiming algorithm is a rather empirical job, with very little support from theory. The situation is somewhat similar to evaluating the factors that determine the dynamic priority of a process: the main objective is to tune the parameters that achieve good system performance, without asking too many questions about why it works well. Often, it’s just a matter of “let’s try this approach and see what happens...” An unpleasant side effect of this empirical approach is the code changes quickly, even in the even-numbered versions of Linux, which are supposed to be stable. The description that follows refers to Linux 2.4.18.
Before plunging into details, let’s give a brief overview of Linux page frame reclaiming. (Looking too close to the trees’ leaves might lead us to miss the whole forest!)
Page frames can be freed in two ways:
By reclaiming an unused page frame within a cache (either a memory cache or a disk cache)
By reclaiming a page that belongs to a memory region of a process or to an IPC shared memory region (see Section 19.3.5)
Of course, the algorithm should take into consideration the various different kinds of page frames. For instance, it is preferable to reclaim page frames from a memory cache rather than from a disk cache because the latter pages include precious data obtained by costly accesses to block disk devices.
Moreover, the algorithm should keep track of the number of accesses to every page frame. If a page has not been accessed for a long time, the probability that it will be accessed in the near future is low; on the other hand, if a page has been accessed recently, the probability that it will continue to be accessed is high. This is just another application of the locality principle mentioned in Section 2.4.7.
Therefore, the page frame reclaiming algorithm is a blend of several heuristics:
Careful selection of the order in which caches are examined
Ordering of pages based on ageing (least recently used pages should be freed before pages accessed recently)
Distinction of pages based on the page state (for example, nondirty pages are better candidates than dirty pages for swapping out because they don’t have to be written to disk)
The main function that triggers page frame reclaiming is
try_to_free_pages( )
. It is invoked every time the
kernel fails in allocating memory. For instance:
When the grow_buffers( )
function fails to
allocate a new buffer page, or the create_buffers( )
function fails to allocate the buffer heads for a buffer
page (see Section 14.2.2
and Section 14.2.3). In these cases, the kernel
executes free_more_memory( )
, which in turn
invokes try_to_free_pages( )
.
When the pages_alloc( )
function fails in
allocating a group of page frames in a given list of memory zones
(see Section 7.1.7). Recall that every
memory zone descriptor includes the pages_min
watermark, which specifies the number of page frames that should
remain free to cope with the “low on
memory” emergencies. If no zone in the list has
enough free memory to satisfy the request while preserving the
minimal pool of free page frames, the kernel invokes the
balance_classzone( )
function, which in turn
invokes try_to_free_pages( )
.
When the kswapd kernel thread discovers that the
number of free page frames in some memory zone falls below the
pages_low
watermark (see the later section Section 16.7.7).
The core of the try_to_free_pages( )
function is
the shrink_caches( )
function: it receives as a
parameter a “goal”—namely, a
given number of page frames to be reclaimed—and it terminates
as soon as it has reached the goal, if possible.
To help shrink_caches( )
do its job, all pages in
dynamic memory are grouped into two lists called the
“active list” and the
“inactive list”; they are also
collectively denoted as LRU lists
. The former list tends to include the
pages that have been accessed recently, while the latter tends to
include the pages that have not been accessed for some time. Clearly,
pages should be stolen from the inactive list, although some
percolation between the two lists is performed from time to time.
The shrink_caches( )
function invokes, in turn,
the following functions:
kmem_cache_reap( )
Removes empty slabs from the slab cache
refill_inactive( )
Moves pages from the active list to the inactive list, and vice versa.
shrink_cache( )
Tries to free page frames by writing to disk inactive pages included in the page cache.
shrink_dcache_memory( )
Removes entries from the dentry cache
shrink_icache_memory( )
Removes entries from the inode cache
Let’s now discuss in greater detail the various components of the page frame reclaiming algorithm.
The active list
and the inactive list
of pages are the core data structures of
the page frame reclaiming algorithm. The heads of these two doubly
linked lists are stored, respectively, in the
active_list
and inactive_list
variables. The nr_active_pages
and
nr_inactive_pages
variables store the number of
pages in the two lists. The pagemap_lru_lock
spin
lock protects the two lists against concurrent accesses in SMP
systems.
If a page belongs to an LRU list, its PG_lru
flag
in the page descriptor is set. Moreover, if the page belongs to the
active list, the PG_active
flag is set, while if
it belongs to the inactive list, the PG_active
flag is cleared. The lru
field of the page
descriptor stores the pointers to the next and previous elements in
the LRU list.
Several auxiliary functions and macros are available to handle the LRU lists:
add_page_to_active_list
Sets the PG_active
flag, adds the page to the head
of the active list, and increases nr_active_pages
.
add_page_to_inactive_list
Adds the page to the head of the inactive list and increases
nr_inactive_pages
.
del_page_from_active_list
Removes the page from the active list, clears the
PG_active
flag, and decreases
nr_active_pages
.
del_page_from_inactive_list
Removes the page from the inactive list and decreases
nr_inactive_pages
.
activate_page_nolock( )
and activate_page( )
If the page is in the inactive list, moves it in the active list by
executing del_page_from_inactive_list
and then
add_page_to_active_list
. The
activate_page( )
function also acquires the
pagemap_lru_lock
spin lock before moving the page.
lru_cache_add( )
If the page is not included in a LRU list, sets the
PG_lru
flag, acquires the
pagemap_lru_lock
spin lock, and executes
add_page_to_inactive_list
to insert the page in
the inactive list.
_ _lru_cache_del( )
and lru_cache_del( )
If the page is included in a LRU list, clears the
PG_lru
flag and executes either
del_page_from_active_list
or
del_page_from_inactive_list
, according to the
value of the PG_active
flag. The
lru_cache_del( )
function also acquires the
pagemap_lru_lock
spin lock before removing the
page.
The kernel collects the pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the kernel collects the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.
Clearly, two page states (“active” and “inactive”) are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is “inactive” for most of the time, the access makes it “active,” thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem because the kernel has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.
The PG_referenced
flag in the page descriptor is
used to double the number of accesses required to move a page from
the inactive list to the active list; it is also used to double the
number of “missing accesses”
required to move a page from the active list to the inactive list
(see below). For instance, suppose that a page in the inactive list
has the PG_referenced
flag set to 0. The first
page access sets the value of the flag to 1, but the page remains in
the inactive list. The second page access finds the flag set and
causes the page to be moved in the active list. If, however, the
second access does not occur within a given time interval after the
first one, the page frame reclaiming algorithm may reset the
PG_referenced
flag.
As shown in Figure 16-4, the kernel uses the
mark_page_accessed( )
and
refill_inactive( )
functions to move the pages
across the LRU lists. In the figure, the LRU list including the page
is specified by the status of the PG_active
flag.
Whenever the kernel must mark a page as accessed, it invokes the
mark_page_accessed( )
function. This happens every
time the kernel determines that a page is being referenced either by
a User Mode process, a filesystem layer, or a device driver. For
instance, mark_page_accessed( )
is invoked in the
following cases:
When loading an anonymous page of a process on demand (performed by
the do_anonymous_page( )
function in
Section 8.4.3).
When reading a block from disk (performed by the bread( )
function in Section 13.4.8).
When loading on demand a page of a memory mapped file (performed by
the filemap_nopage( )
function in
Section 15.2.4).
When reading a page of data from a file (performed by the
do_generic_file_read( )
function in
Section 15.1.1).
When swapping in a page (see the earlier section Section 16.6.1).
When the kernel finds the Accessed
flag set in the
Page Table entry while searching for a page to be swapped out (see
the earlier section Section 16.5.1).
When the kernel reads a page of data from a disk device (performed by
the ext2_get_page( )
function in Chapter 17).
The mark_page_accessed( )
function executes the
following code fragment:
if (PageActive(page) || !PageReferenced(page)) SetPageReferenced(page); else { activate_page(page); ClearPageReferenced(page); }
As shown in Figure 16-4, the function moves the page
from the inactive list to the active list only if the
PG_referenced
flag is set before the invocation.
The kernel periodically checks the status of the pages in the active
list by executing the refill_inactive( )
function.
Starting from the bottom of the active list (the older pages in the
list), the function checks whether the
PG_referenced
flag of each page is set. If it is,
the function clears the flag and moves the page into the first
position of the active list; if it isn’t, the
function moves the page into the first position of the inactive list.
The logic in the function is as follows:
if (PageReferenced(page)) { ClearPageReferenced(page); list_del(&page->lru); list_add(&page->lru, &active_list); } else { del_page_from_active_list(page); add_page_to_inactive_list(page); SetPageReferenced(page); }
The refill_inactive( )
function does not scan the
pages in the inactive list; hence, the
PG_referenced
flag of a page is never cleared as
long as the page remains in the inactive list.
The
try_to_free_pages( )
function is the main function
that triggers the reclaiming of page frames. It receives as
parameters:
classzone
The memory zone containing the page frames to be reclaimed
gfp_mask
A set of flags whose meaning is exactly the same as the corresponding
parameter of the alloc_pages( )
function (see
Section 7.1.5)
order
Not used
The goal of the function is to free
SWAP_CLUSTER_MAX
page frames (usually, 32) by
repeatedly invoking the shrink_caches( )
function,
each time with a higher priority than the previous invocation. The
try_to_free_pages( )
function is thus essentially
equivalent to the following code fragment:
int priority = DEF_PRIORITY; int nr_pages = SWAP_CLUSTER_MAX; if (current->flags & PF_NOIO) gfp_mask &= ~(_ _GFP_IO | _ _GFP_HIGHIO | _ _GFP_FS); do { nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages); if (nr_pages <= 0) return 1; } while (--priority); out_of_memory( );
try_to_free_pages( )
clears the _ _GFP_IO
, _ _GFP_HIGHIO
, and _ _GFP_FS
bits in the gfp_mask
parameter
if the current process has the PF_NOIO
flag set.
This flag is set whenever the kernel must ensure that the page frame
reclaiming algorithm never triggers I/O data transfers; it is
currently used only by kernel threads of the
loop device driver, which allows User Mode
processes to handle regular files as if they were disk block
partitions.
The loop is repeated at most DEF_PRIORITY
times
(usually six times). The value of the decreasing
priority
loop index is passed to the
shrink_caches( )
function. Each time
shrink_caches( )
tries harder to reclaim a page
frame because lower values correspond to higher priorities.
If DEF_PRIORITY
iterations are not enough to
reclaim SWAP_CLUSTER_MAX
page frames, the kernel
is in serious trouble. It has just one last resort: killing a User
Mode process to free all its page frames. This operation is performed
by the out_of_memory( )
function. Loosely
speaking, the victim process is selected from those that have the
smallest runtimes, ruling out those that have superuser privileges
and those that perform direct I/O operations (see Section 15.3).
The shrink_caches( )
function invokes several
auxiliary functions in a fixed order to reclaim page frames in
different memory subsystems. One of the functions invoked is called
shrink_cache( )
and should not be confused with
the parent function shink_caches( )
. The function
shrink_caches( )
acts on the following parameters:
classzone
The memory zone that contains the page frames to be freed.
priority
The “priority” of this trial: it tells how drastic page frame reclaiming must be.
gfp_mask
Memory allocator flags, specifying the type of page frames to be freed, as well as what the kernel is allowed to do in pursuit of its goal (block the current process, trigger I/O transfers, and so on).
nr_pages
The goal—i.e., the number of page frames to be freed.
The function returns the difference between
nr_pages
and the number of page frames that have
been effectively reclaimed; if more than nr_pages
page frames have been freed, the function returns 0. It executes the
following actions:
Invokes kmem_cache_reap( )
to reclaim page frames
from the slab allocator caches (see Section 7.2.7).
If kmem_cache_reap( )
succeeds in freeing at least
nr_pages
page frames, returns 0.
Invokes the refill_inactive( )
function to move
some pages from the active list to the inactive list. As described in
the earlier section Section 16.7.2,
refill_inactive( )
clears the
PG_referenced
flags of the pages at the bottom of
the active list, and moves the pages that have not been accessed
since the previous execution of shrink_caches( )
.
The number of pages to be moved is passed as a parameter to
refill_inactive( )
; it is computed as:
ratio = nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
The rationale behind the formula is to keep the size of active list roughly equal to two-thirds of the page cache size (that’s another rule of thumb, of course).
Invokes shrink_cache( )
to try to reclaim
nr_pages
from the inactive list (see the next
section). If the function succeeds in freeing all the required page
frames, it returns 0 (all requested page frame have been freed).
At this point, shrink_caches( )
has lost any hope
of reaching its target in the current execution. However, it tries to
free small objects in several disk caches so that the future
invocations of the function will likely succeed in releasing the page
frames storing them. Thus, the function invokes
shrink_dcache_memory( )
to remove dentry objects
from the dentry cache (see the later section Section 16.7.6).
Invokes shrink_icache_memory( )
to remove inode
objects from the inode cache (see Section 16.7.6 later in this chapter)
If the kernel has support for disk quota, the function invokes
shrink_dqcache_memory( )
to remove objects from
the disk quota cache. (We won’t discuss disk quota
for lack of space.)
Returns the number of page frames still to be freed.
The shrink_cache( )
function acts on the same
parameters as shrink_caches( )
:
nr_pages
, classzone
,
gfp_mask
, and priority
. It
looks for page frames to be reclaimed in the inactive list. Since the
last inserted elements are placed near the head of the list, the
function starts scanning from the tail of the list and moves
backward.
The function achieves its goal by:
Freeing to the buddy system the page frames that do not belong to any process
Swapping out pages belonging to processes if there are too many of them in the scanned portion of the inactive list
The priority
parameter controls the size of the
portion of the inactive list to be scanned in this invocation of
shrink_cache( )
. If priority
is
equal to 6 (DEF_PRIORITY
, the lowest priority),
the function scans at most one-sixth of the list; as
priority
decreases, the function scans more and
more of the list until it reaches 1 (the highest priority), whereupon
it scans the whole list.
The function starts by acquiring the
pagemap_lru_lock
spin lock and then loops over the
pages of the inactive list backwards until either the chosen portion
of the list is scanned or the number of elements specified by
priority
is reached. For every scanned page, the
function performs the following actions:
If the need_resched
field of the current process
is set, temporarily relinquishes the CPU by releasing the
pagemap_lru_lock
spin lock and invoking
schedule( )
. When executing again, the function
reacquires the spin lock and continues.
Moves the page from the tail to the head of the inactive list. This
ensures that inactive pages are considered in a round-robin fashion
in successive invocations of shrink_cache( )
.
Checks whether the usage counter of the page is null; if so, it continues with the next page at the tail of the list. Ideally, any page with a null usage counter should belong to the buddy system; however, to free a page frame, first its usage counter is decremented and then the page frame is released to the buddy system. Therefore, there is a small time window in which the page frame reclaiming algorithm may see the page freed.
Checks whether the page belongs to the memory zone specified by the
classzone
parameter; if not, the function stops
working on this page and continues with the next page at the tail of
the list.
Checks whether the page is not a buffer page and whether its usage
counter is greater than one or the page doesn’t have
an image on disk (mapping
field set to
NULL
). In this case, the page frame cannot be
released to the buddy system because the usage counter indicates that
there are processes that still reference the page. The function
performs the following substeps:
Increments a local counter storing the number of scanned pages that cannot be freed.
If the counter exceeds a threshold value, releases the
pagemap_lru_lock
spin lock, invokes
swap_out( )
to swap out some process pages (see
the earlier section Section 16.5),
and returns the number of pages frames yet to be freed. The threshold
value is equal to the smaller of two values: one-tenth of the number
of pages to be scanned and
210-
priority
times the
number of page frames to be released (the nr_pages
parameter).
Otherwise, if the counter does not exceed the threshold value, the function continues with the next page at the tail of the inactive list.
If the function reaches this point, the page frame can be released to
the buddy system. The function tries to lock the page. If the
PG_locked
flag of the page is already set, it
executes the following substeps:
If both the PG_launder
flag of the page and the
_ _GFP_FS
bit in the gfp_mask
parameter are set, invokes wait_on_page( )
to
sleep until the page is unlocked. The PG_launder
flag is set whenever the page is involved in an I/O data transfer
triggered by the shrink_cache( )
function itself.
Continues with the next page at the tail of the inactive list.
Now the page is locked. Checks whether the page is dirty
(PG_dirty
flag set), the page has an image on disk
(mapping
field not NULL
), and
it is owned only by the page cache—that is, whether the
underlying page frame is effectively freeable. If all these
conditions hold, and moreover, the _ _GFP_FS
bit
in the gfp_mask
parameter is set, the function
updates the disk image by performing the following actions:
Clears the PG_dirty
flag.
Sets the PG_launder
flag so that a future
invocation of shrink_cache( )
waits for the
completion of the I/O data transfer.
Increments the page usage counter (fail-safe mechanism) and releases
the pagemap_lru_lock
spin lock.
Invokes the writepage
method of the
address_space
object of the page. As described in
Section 15.2.5, the method activates the
I/O data transfer of the page contents to the disk.
Decrements the page usage counter and acquires the
pagemap_lru_lock
spin lock again.
Continues with the next page at the tail of the inactive list.
If the page is a buffer page (the buffers
field is
not NULL
), the function tries to free the buffers
contained in the page. In particular, it performs the following
substeps:
Releases the pagemap_lru_lock
spin lock and
increments the page usage counter (a fail-safe mechanism).
Invokes the try_to_release_page( )
function. In
turn, this function:
Executes the releasepage
method of the
corresponding address_space
object, if it is
defined, to release the metadata associated with the buffers in
journaling filesystems.
Invokes the try_to_free_buffers( )
function to
free the buffers in the page, provided they are referenced only by
the buffer cache (see Section 14.2.2).
If try_to_release_page( )
failed in releasing all
the buffers in the page, the function unlocks the page, decrements
its usage counter to undo the increment done in Step 8a, and
continues with the next page at the tail of the inactive list.
Otherwise, if try_to_release_page( )
succeeded in
releasing all the buffers in the page, the function tries to release
the page frame itself. In particular, if the page is anonymous (i.e.,
has no image on disk), the function acquires the
pagemap_lru_lock
spin lock, unlocks the page,
removes the page from the inactive list, and releases the page frame
to the buddy system. If the function has released the number of page
frames that was its goal, it also releases the spin lock and returns
the value 0; otherwise, it continues with Step 9.
The buffer page is included in the page cache because it has an image
on disk. It decrements its usage counter to undo the increment in
Step 8a and acquires the pagemap_lru_lock
spin
lock. It then continues with the following step.
Acquires the pagecache_lock
spin lock.
If the page has no image on disk or if the page is still referenced
by some process, the function releases the
pagecache_lock
spin lock, unlocks the page, and
jumps back to Step 5a.
If the function reaches this point, the page has an image on disk and
is freeable because no process references it and it no longer
includes any buffers. Checks whether the page is dirty (i.e., the
PG_dirty
flag is set); in this case, the page
frame cannot be freed; otherwise, the data is lost. The function
releases the pagecache_lock
spin lock, unlocks the
page, and continues with the next page at the tail of the inactive
list.
If the function reaches this point, the page has an image on disk, is
freeable, and is clean, so the page frame can effectively be
reclaimed. If the page belongs to the swap cache, the function gets
the swapped-out page identifier from the index
field, invokes delete_from_swap_cache( )
to remove
the page descriptor from the swap cache, releases the
pagecache_lock
spin lock, and then invokes
swap_free( )
to decrement the usage counter of the
page slot.
Otherwise, it checks whether the page belongs to the swap cache. If
not, it invokes remove_inode_page( )
to remove it
from the page cache (see Section 14.1.3)
and releases the pagecache_lock
spin lock.
Invokes _ _lru_cache_del( )
to remove the page
from the inactive list.
Unlocks the page.
Releases the page frame to the buddy system.
If the goal on the number of page frames to be freed is reached, the function releases the spin lock and returns the value 0; otherwise, it continues with the next page at the tail of the inactive list.
Dentry and inode objects themselves aren’t big, but
freeing one of them has a cascading effect that can ultimately free a
lot of memory by releasing several data structures. For this reason,
the shrink_caches( )
function invokes two special
purpose functions to reclaim page frames from the dentry and inode
caches.
The shrink_dcache_memory( )
function is invoked to remove dentry objects from the
dentry cache. Clearly, only dentry objects not referenced by any
process (defined as unused dentries in Section 12.2.4) can be removed.
Since the dentry cache objects are allocated through the slab
allocator, the shrink_dcache_memory( )
function
may lead some slabs to become free, causing some page frames to be
consequently reclaimed by kmem_cache_reap( )
.
Moreover, the dentry cache acts as a controller of the inode cache.
Therefore, when a dentry object is released, the buffer storing the
corresponding inode becomes unused and the shrink_mmap( )
function may release the corresponding buffer page.
The shrink_dcache_memory( )
function, which acts
on the two well-known parameters priority
and
gfp_mask
, performs the following steps:
Returns 0 if the kernel is not allowed to trigger operations on
filesystem on-disk data structures (the _ _GFP_IO
bit is cleared in the gfp_mask
parameter).
Otherwise, invokes prune_dcache( )
, passing it a
parameter that is the ratio between the number of unused dentries and
the value of priority
.
Invokes kmem_cache_shrink( )
on the dentry cache
to release frames that contained objects freed in the previous step.
Returns 0.
The prune_dcache( )
function receives a parameter
that specifies the number of objects to free. It scans the list of
unused dentries until it reaches the requested number of freed
objects or until the whole list is scanned. On each object that
wasn’t recently referenced, the function calls
prune_one_dentry( )
.
The prune_one_dentry( )
function, in turn,
performs the following operations.
Removes the dentry object from the dentry hash table, from the list of dentry objects in its parent directory, and from the list of dentry objects of the owner inode.
Decrements the usage counter of the dentry’s inode
by invoking the d_iput
dentry method, if defined,
or the iput( )
function.
Invokes the d_release
method of the dentry object,
if defined.
Invokes kmem_cache_free( )
to release the object
to the slab allocator (see Section 7.2.13).
Decrements the usage counter of the parent directory.
The
shrink_icache_memory( )
function is invoked to
remove inode objects from the inode cache. It is very similar to the
shrink_dcache_memory( )
just described. It checks
the _ _GFP_FS
bit in the
gfp_mask
parameter, and then invokes
prune_icache( )
, passing to it the number of
inodes to be freed—namely the ratio between the number of
unused inodes and the value of the priority
parameter. Finally, it invokes the kmem_cache_shrink( )
function to release to the buddy system each page frame
that has been completely freed by prune_icache( )
.
The prune_icache( )
function, in turn, scans the
inode_unused
list (see Section 12.2.2), looking for inodes that can be freed.
Basically, a good inode should be clean, it should have a null usage
counter, its I_FREEING
,
I_CLEAR
, and I_LOCK
flags
should be cleared, and it should not include any buffer head in its
lists of dirty buffers. Any such inode is freed by invoking the
clear_inode( )
and kmem_cache_free( )
functions.
If prune_icache( )
fails in releasing the
requested number of inode objects, it schedules the execution of the
try_to_sync_unused_inodes( )
function, which
flushes some unused inodes to disk. Since this function might block,
it is executed by the keventd kernel thread (see
Section 3.4.2).
The kswapd
kernel thread is another kernel
mechanism that activates the reclamation of memory. Why is it
necessary? Is it not sufficient to invoke try_to_free_pages( )
when free memory becomes really scarce and another memory
allocation request is issued?
Unfortunately, this is not the case. Some memory allocation requests are performed by interrupt and exception handlers, which cannot block the current process waiting for a page frame to be freed; moreover, some memory allocation requests are done by kernel control paths that have already acquired exclusive access to critical resources and that, therefore, cannot activate I/O data transfers. In the infrequent case in which all memory allocation requests are done by such sorts of kernel control paths, the kernel is never able to free memory.
kswapd also has a beneficial effect on system performances by keeping memory free in what would otherwise be idle time for the machine; processes can thus get their pages much faster.
The kswapd kernel thread is activated when a zone includes a number of free page frames that is below a “warning” threshold. Essentially, the kernel starts to reclaim some page frames in order to avoid much more dramatic “low on memory” conditions.
The “warning” threshold is stored
into the pages_low
field of the memory zone
descriptor. Recall from Section 7.1.2 that
this descriptor also includes the pages_min
field
(a threshold that specifies the minimum number of free page frames
that should always be preserved) and the
pages_high
field (a threshold that specifies the
“safe” number of free page frames
above which page frame reclaiming should be stopped). Usually, the
pages_min
field is set to the ratio between the
size of the memory zone in pages and the number 128, the
pages_low
field is set to twice
pages_min
, and the pages_high
field is set to three times the value of
pages_min
.[114]
The kswapd kernel thread is usually inserted
into the kswapd_wait
wait queue. As mentioned in
Section 7.1.5, if the
alloc_pages( )
function fails to find a memory
zone that has more than pages_low
page frames, it
sets the need_balance
field of the first checked
memory zone and wakes up kswapd.
The kswapd kernel thread executes the
kswapd( )
function, which at each activation
performs the following operations:
Sets the state of current
to
TASK_RUNNING
and removes it from the
kswapd_wait
wait queue.
Invokes kswapd_balance( )
(see below).
Invokes run_task_queue( )
on the
tq_disk
task queue to activate the strategy
routines of the block device drivers (see Section 13.4.6). This relieves pressure on memory by starting
scheduled I/O operations and thus eventually allowing the kernel to
free asynchronous buffer heads and pages in the page cache.
Sets the state of current
to
TASK_INTERRUPTIBLE
and adds it to the
kswapd_wait
wait queue.
Checks the need_balance
flags of every memory zone
descriptor (see Section 7.1.2). If none
are set, invokes schedule( )
to put the
kswapd kernel thread to sleep. When executed
again, it jumps to Step 1.
The kswapd_balance( )
function checks the
need_balance
flag of all existing memory zones.
For any memory zone having the flag set, the function invokes
try_to_free_pages( )
to start reclaiming page
frames. As we know, the latter function might not succeed in freeing
SWAP_CLUSTER_MAX
page frames; in this case, a
process is killed. When this happens, kswapd_balance( )
suspends itself for one second, thus letting the kernel
reclaim the page frames owned by the victim.
The kswapd_balance( )
function keeps invoking
try_to_free_pages( )
on a memory zone until the
number of free page frames in the zone (or in one of the other zones
in the node) rises above the threshold stored in the
pages_high
field of the memory zone descriptor.
[114] A system administrator
may set new values for the pages_min
fields, and
hence also new values for the pages_low
and
pages_high
fields, by passing ad-hoc ratios for
every memory zone with the memfrac kernel boot
option. However, the minimal allowed value for
pages_min
is 20 and the maximum allowed value is
255.