The pages swapped out from memory are stored in a swap area
,
which may be implemented either as a disk partition of its own or as
a file included in a larger partition. Several different swap areas
may be defined, up to a maximum number specified by the
MAX_SWAPFILES
macro (usually set to 32).
Having multiple swap areas allows a system administrator to spread a lot of swap space among several disks so that the hardware can act on them concurrently; it also lets swap space be increased at runtime without rebooting the system.
Each swap area consists of a sequence of page slots
: 4,096-byte blocks used to contain a
swapped-out page. The first page slot of a swap area is
used to persistently store some information about the swap area; its
format is described by the swap_header
union
composed of two structures, info
and
magic
. The
magic
structure provides a string that
marks part of the disk unambiguously as a swap area; it consists of
just one field, magic.magic
, which contains a
10-character “magic” string. The
magic
structure essentially allows the kernel to
unambiguously identify a file or a partition as a swap area; the text
of the string depends on the swapping algorithm version:
SWAP-SPACE
for Version 1 or
SWAPSPACE2
for Version 2. The field is always
located at the end of the first page slot.
The info
structure includes the following
fields:
info.bootbits
Not used by the swapping algorithm; this field corresponds to the first 1,024 bytes of the swap area, which may store partition data, disk labels, and so on.
info.version
Swapping algorithm version.
info.last_page
Last page slot that is effectively usable.
info.nr_badpages
Number of defective page slots.
info.padding[125]
Padding bytes.
info.badpages[1]
Up to 637 numbers specifying the location of defective page slots.
The data stored in a swap area is meaningful as long as the system is on. When the system is switched off, all processes are killed, so all data stored by processes in swap areas is discarded. For this reason, swap areas contain very little control information; essentially, the swap area type and the list of defective page slots. This control information easily fits in a single 4 KB page.
Usually, the system administrator creates a swap partition when
creating the other partitions on the Linux system, and then uses the
mkswap
command to set up the disk area as a
new swap area. That command initializes the fields just described
within the first page slot. Since the disk may include some bad
blocks, the program also examines all other page slots to locate the
defective ones. But executing the
mkswap
command leaves the
swap area in an inactive state. Each swap area
can be activated in a script file at system boot or dynamically after
the system is running. An initialized swap area is considered
active
when it effectively represents an
extension of the system RAM (see Section 16.2.3 later in this chapter).
Each
active swap area has its own swap_info_struct
descriptor in memory. The fields of the descriptor are illustrated in
Table 16-1.
Table 16-1. Fields of a swap area descriptor
Type |
Field |
Description |
---|---|---|
|
|
Swap area flags |
|
|
Device number of the swap disk partition |
|
|
Swap area descriptor spin lock |
|
|
Dentry of the file or device file |
|
|
Mounted filesystem descriptor of the file or device file |
|
|
Pointer to array of counters, one for each swap area page slot |
|
|
First page slot to be scanned when searching for a free one |
|
|
Last page slot to be scanned when searching for a free one |
|
|
Next page slot to be scanned when searching for a free one |
|
|
Number of free page slot allocations before restarting from the beginning |
|
|
Swap area priority |
|
|
Number of usable page slots |
|
|
Size of swap area in pages |
|
|
Pointer to next swap area descriptor |
The flags
field includes two overlapping subfields:
SWP_USED
1 if the swap area is active; 0 if it is nonactive.
SWP_WRITEOK
This 2-bit field is set to 3 if it is possible to write into the swap
area and to 0 otherwise; since the least-significant bit of this
field coincides with the bit used to implement
SWP_USED
, a swap area can be written only if it is
active. The kernel is not allowed to write in a swap area when it is
being activated or deactivated.
The swap_map
field points to an array of counters,
one for each swap area page slot. If the counter is equal to 0, the
page slot is free; if it is positive, the page slot is filled with a
swapped-out page (the exact meaning of positive values is discussed
later in Section 16.3).
If the counter has the value SWAP_MAP_MAX
(equal
to 32, 767), the page stored in the page slot is
“permanent” and cannot be removed
from the corresponding slot. If the counter has the value
SWAP_MAP_BAD
(equal to 32,768), the
page
slot is considered defective, and thus unusable.[107]
The prio
field is a signed integer that denotes
the order in which the swap subsystem should consider each swap area.
Swap areas implemented on faster disks should have a higher priority
so they will be used first. Only when they are filled does the
swapping algorithm consider lower-priority swap areas. Swap areas
that have the same priority are cyclically selected to distribute
swapped-out pages among them. As we shall see in Section 16.2.3, the priority is assigned
when the swap area is activated.
The sdev_lock
field is a spin lock that protects
the descriptor against concurrent accesses in SMP systems.
The swap_info
array includes
MAX_SWAPFILES
swap area descriptors. Of course,
not all of them are necessarily used, only those having the
SWP_USED
flag set. Figure 16-1
illustrates the swap_info
array, one swap area,
and the corresponding array of counters.
The nr_swapfiles
variable stores the index of the
last array element that contains, or that has contained, a used swap
area descriptor. Despite its name, the variable does not contain the number of active swap areas.
Descriptors of active swap areas are also inserted into a list sorted
by the swap area priority. The list is implemented through the
next
field of the swap area descriptor, which
stores the index of the next descriptor in the
swap_info
array. This use of the field as an index
is different from most fields with the name next
,
which are usually pointers.
The swap_list
variable, of type
swap_list_t
, includes the following fields:
head
Index in the swap_info
array of the first list
element.
next
Index in the swap_info
array of the descriptor of
the next swap area to be selected for swapping out pages. This field
is used to implement a round-robin algorithm among maximum-priority
swap areas with free slots.
The swaplock
spin lock protects the list against
concurrent accesses in multiprocessor systems.
The max
field of the swap area descriptor stores
the size of the swap area in pages, while the
pages
field stores the number of usable page
slots. These numbers differ because pages
does not
take the first page slot and the defective page slots into
consideration.
Finally, the nr_swap_pages
variable contains the
number of available (free and nondefective) page slots in all active
swap areas, while the total_swap_pages
variable
contains the total number of nondefective page slots.
A swapped-out page is uniquely identified quite simply by specifying
the index of the swap area in the swap_info
array
and the page slot index inside the swap area. Since the first page
(with index 0) of the swap area is reserved for the
swap_header
union discussed earlier, the first
useful page slot has index 1. The format of a swapped-out page identifier
is illustrated in Figure 16-2.
The SWP_ENTRY(type,offset)
macro constructs a
swapped-out page identifier from the swap area index
type
and the page slot index
offset
. Conversely, the
SWP_TYPE
and SWP_OFFSET
macros
extract from a swapped-out page identifier the swap area index and
the page slot index, respectively.
When a page is swapped out, its identifier is inserted as the
page’s entry into the Page Table so the page can be
found again when needed. Notice that the least-significant bit of
such an identifier, which corresponds to the
Present
flag, is always cleared to denote the fact
that the page is not currently in RAM. However, at least one of the
remaining 31 bits has to be set because no page is ever stored in
slot 0 of swap area 0. It is therefore possible to identify
three different cases from the value of a Page Table entry:
The page does not belong to the process address space.
The page is currently swapped out.
The page is contained in RAM.
Notice that the maximum size of a swap area is determined by the number of bits available to identify a slot. On the 80 × 86 architecture, the 24 bits available limit the size of a swap area to 224 slots (that is, to 64 GB).
Since a page may belong to the address spaces of several processes
(see the later section Section 16.3), it may be swapped out from
the address space of one process and still remain in main memory;
therefore, it is possible to swap out the same page several times. A
page is physically swapped out and stored just once, of course, but
each subsequent attempt to swap it out increments the
swap_map
counter.
The swap_duplicate( )
function is usually invoked
while trying to swap out an already swapped-out page. It just
verifies that the swapped-out page identifier passed as its parameter
is valid and increments the corresponding swap_map
counter. More precisely, it performs the following actions:
Uses the SWP_TYPE
and
SWP_OFFSET
macros to extract the swap area number
type
and the page slot index
offset
from the parameter.
Checks whether the swap area is activated; if not, it returns 0 (invalid identifier).
Checks whether the page slot is valid and not free (its
swap_map
counter is greater than 0 and less than
SWAP_MAX_BAD
); if not, returns 0 (invalid
identifier).
Otherwise, the swapped-out page identifier locates a valid page.
Increments the swap_map
counter of the page slot
if it has not already reached the value
SWAP_MAP_MAX
.
Returns 1 (valid identifier).
Once a swap area is initialized, the superuser (or, more precisely,
any user having the CAP_SYS_ADMIN
capability, as
described in Section 20.1.1) may use the
swapon and swapoff programs
to activate and deactivate the swap area, respectively. These
programs use the swapon( )
and swapoff( )
system calls; we’ll briefly sketch out
the corresponding service routines.
The sys_swapon( )
service
routine receives the following as parameters:
specialfile
This parameter points to the pathname (in the User Mode address space) of the device file (partition) or plain file used to implement the swap area.
swap_flags
This parameter consists of a single
SWAP_FLAG_PREFER
bit plus 31 bits of priority of
the swap area (these bits are significant only if the
SWAP_FLAG_PREFER
bit is on).
The function checks the fields of the swap_header
union that was put in the first slot when the swap area was created.
The function performs these main steps:
Checks that the current process has the
CAP_SYS_ADMIN
capability.
Searches for the first element in the swap_info
array of swap area descriptors that have the
SWP_USED
flag cleared, meaning that the
corresponding swap area is inactive. If there is none, there are
already MAX_SWAPFILES
active swap areas, so the
function returns an error code.
A descriptor for the swap area has been found. The function
initializes the descriptor’s fields (setting
flags
to SWP_USED
, setting
lowest_bit
and highest_bit
to
0, and so on). Moreover, if the descriptor’s index
is greater than nr_swapfiles
, the function updates
that variable.
If the swap_flags
parameter specifies a priority
for the new swap area, the function sets the prio
field of the descriptor. Otherwise, it initializes the field with the
lowest priority among all active swap areas minus 1 (thus assuming
that the last activated swap area is on the slowest block device). If
no other swap areas are already active, the function assigns the
value -1.
Copies the string pointed to by the specialfile
parameter from the User Mode address space.
Invokes path_init( )
and path_walk( )
to perform a pathname lookup on the string copied from
the User Mode address space (see Section 12.5).
Stores the addresses of the dentry object and of the mounted
filesystem descriptor returned by path_walk( )
in
the swap_file
and swap_vfsmnt
fields of the swap area descriptor, respectively.
If the specialfile
parameter identifies a block
device file, the function performs the following substeps:
Stores the device number in the swap_device
field
of the descriptor.
Sets the block size of the device to 4 KB—that is, sets its
blksize_size
entry to
PAGE_SIZE
.
Initializes the block device driver by invoking the
bd_acquire( )
and do_open( )
functions, described in Section 13.4.2.
Checks to make sure that the swap area was not already activated by
looking at the address_space
objects of the other
active swap areas in swap_info
(given an address
q
of a swap area descriptor, the corresponding
address_space
object is obtained by
q->swap_file->d_inode->i_mapping
). If the
swap area is already active, it returns an error code.
Allocates a page frame and invokes rw_swap_page_nolock( )
(see Section 16.4 later in this chapter) to
fill it with the swap_header
union stored in the
first page of the swap area.
Checks that the magic string in the last ten characters of the first
page in the swap area is equal to SWAP-SPACE
or to
SWAPSPACE2
(there are two slightly different
versions of the swapping algorithm). If not, the
specialfile
parameter does not specify an already
initialized swap area, so the function returns an error code. For the
sake of brevity, we’ll suppose that the swap area
has the SWAPSPACE2
magic string.
Initializes the lowest_bit
and
highest_bit
fields of the swap area descriptor
according to the size of the swap area stored in the
info.last_page
field of the
swap_header
union.
Invokes vmalloc( )
to create the array of counters
associated with the new swap area and store its address in the
swap_map
field of the swap descriptor. Initializes
the elements of the array to 0 or to SWAP_MAP_BAD
,
according to the list of defective page slots stored in the
info.bad_pages
field of the
swap_header
union.
Computes the number of useful page slots by accessing the
info.last_page
and
info.nr_badpages
fields in the first page slot.
Sets the flags
field of the swap descriptor to
SWP_WRITEOK
, sets the pages
field to the number of useful page slots, and updates the
nr_swap_pages
and
total_swap_pages
variables.
Inserts the new swap area descriptor in the list to which the
swap_list
variable points.
Releases the page frame that contains the data of the first page of the swap area and returns 0 (success).
The sys_swapoff( )
service routine deactivates a swap area
identified by the parameter specialfile
. It is
much more complex and time-consuming than sys_swapon( )
, since the partition to be deactivated might still
contain pages that belong to several processes. The function is thus
forced to scan the swap area and to swap in all existing pages. Since
each swap in requires a new page frame, it might fail if there are no
free page frames left. In this case, the function returns an error
code. All this is achieved by performing the following major steps:
Checks that the current process has the
CAP_SYS_ADMIN
capability.
Copies the string pointed to by specialfile
, and
invokes path_init( )
and path_walk( )
to perform a pathname lookup.
Scans the list to which swap_list
points and
locates the descriptor whose swap_file
field
points to the dentry object found by the pathname lookup. If no such
descriptor exists, an invalid parameter was passed to the function,
so it returns an error code.
Otherwise, if the descriptor exists, checks that its
SWP_WRITE
flag is set; if not, returns an error
code because the swap area is already being deactivated by another
process.
Removes the descriptor from the list and sets its
flags
field to SWP_USED
so the
kernel doesn’t store more pages in the swap area
before this function deactivates it.
Subtracts the swap area size stored in the pages
field of the swap area descriptor from the values of
nr_swap_pages
and
total_swap_pages
.
Invokes the try_to_unuse( )
function (see below)
to successively force all pages left in the swap area into RAM and to
correspondingly update the Page Tables of the processes that use
these pages.
If try_to_unuse( )
fails in allocating all
requested page frames, the swap area cannot be deactivated.
Therefore, the function executes the following substeps:
Reinserts the swap area descriptor in the
swap_list
list and sets its
flags
field to SWP_WRITEOK
(see
Step 5)
Adds the content of the pages
field to the
nr_swap_pages
and
total_swap_pages
variables (see Step 6)
Invokes path_release( )
to release the VFS objects
allocated by path_walk( )
in Step 2.
Finally, returns an error code.
Otherwise, all used page slots have been successfully transferred to RAM. Therefore, the function executes the following substeps:
If specialfile
identifies a block device file,
releases the corresponding block device driver.
Invokes path_release( )
to release the VFS objects
allocated by path_walk( )
in Step 2.
Releases the memory area used to store the
swap_map
array.
Invokes path_release( )
again because the VFS
objects that refer to specialfile
have been
allocated by the path_walk( )
function invoked by
sys_swapon( )
(see Step 6 in the previous
section).
Returns 0 (success).
As stated previously, the
try_to_unuse( )
function swaps in pages and
updates all the Page Tables of processes that have swapped out pages.
To that end, the function visits the address spaces of all kernel
threads and processes, starting with the init_mm
memory descriptor that is used as a marker. It is a time-consuming
function that runs mostly with the interrupts enabled.
Synchronization with other processes is therefore critical.
The try_to_unuse( )
function scans the
swap_map
array of the swap area. When the function
finds a in-use page slot, it first swaps in the page, and then starts
looking for the processes that reference the page. The ordering of
these two operations is crucial to avoid race conditions. While the
I/O data transfer is ongoing, the page is locked, so no process can
access it. Once the I/O data transfer completes, the page is locked
again by try_to_unuse( )
, so it cannot be swapped
out again by another kernel control path. Race conditions are also
avoided because each process looks up the page cache before starting
a swap in or swap out operation (see the later section Section 16.3). Finally, the swap area
considered by try_to_unuse( )
is marked as
nonwritable (SWP_WRITE
flag is not set), so no
process can perform a swap out on a page slot of this area.
However, try_to_unuse( )
might be forced to scan
the swap_map
array of usage counters of the swap
area several times. This is because memory regions that contain
references to swapped-out pages might disappear during one scan and
later reappear in the process lists.
For instance, recall the description of the do_munmap( )
function (in Section 8.3.5):
whenever a process releases an interval of linear addresses,
do_munmap( )
removes from the process list all
memory regions that include the affected linear addresses; later, the
function reinserts the memory regions that have been only partially
unmapped in the process list. do_munmap( )
takes
care of freeing the swapped-out pages that belong to the interval of
released linear addresses; however, it commendably
doesn’t free the swapped-out pages that belong to
the memory regions that have to be reinserted in the process list.
Hence, try_to_unuse( )
might fail in finding a
process that references a given page slot because the corresponding
memory region is temporarily not included in the process list. To
cope with this fact, try_to_unuse( )
keeps
scanning the swap_map
array until all reference
counters are null. Eventually, the ghost memory regions referencing
the swapped-out pages will reappear in the process lists, so
try_to_unuse( )
will succeed in freeing all page
slots.
Let’s describe now the major operations executed by
try_to_unuse( )
. It executes a continuous loop on
the reference counters in the swap_map
array of
the swap area passed as its parameter. For each reference counter,
the function performs the following steps:
If the counter is equal to 0 (no page is stored there) or to
SWAP_MAP_BAD
, it continues with the next page
slot.
Otherwise, it invokes the read_swap_cache_async( )
function (see Section 16.4 later in this chapter) to
swap in the page. This consists of allocating, if necessary, a new
page frame, filling it with the data stored in the page slot, and
putting the page in the swap cache.
Waits until the new page has been properly updated from disk and locks it.
While the function was executing the previous step, the process could have been suspended. Therefore, it checks again whether the reference counter of the page slot is null; if so, it continues with the next page slot (this swap page has been freed by another kernel control path).
Invokes unuse_process( )
on every memory
descriptor in the doubly linked list whose head is
init_mm
(see Section 8.2). This time-consuming function scans all Page
Table entries of the process that owns the memory descriptor, and
replaces each occurrence of the swapped-out page identifier with the
physical address of the page frame. To reflect this move, the
function also decrements the page slot counter in the
swap_map
array (unless it is equal to
SWAP_MAP_MAX
) and increments the usage counter of
the page frame.
Invokes shmem_unuse( )
to check whether the
swapped-out page is used for an IPC shared memory resource and to
properly handle that case (see Section 19.3.5).
Checks the value of the reference counter of the page. If it is equal
to SWAP_MAP_MAX
, the page slot is
“permanent.” To free it, it forces
the value 1 into the reference counter.
The swap cache might own the page as well (it contributes to the
value of the reference counter). If the page belongs to the swap
cache, it invokes the rw_swap_page( )
function to
flush its contents on disk (if the page is dirty), invokes
delete_from_swap_cache( )
to remove the page from
the swap cache, and decrements its reference counter.
Sets the PG_dirty
flag of the page descriptor and
unlocks the page.
Checks the need_resched
field of the current
process; if it is set, it invokes schedule( )
to
relinquish the CPU. Deactivating a swap area is a long job, and the
kernel must ensure that the other processes in the system still
continue to execute. The try_to_unuse( )
function
continues from this step whenever the process is selected again by
the scheduler.
Proceeds with the next page slot. starting at Step 1.
The function continues until every reference counter in the
swap_map
array is null. Recall that even if the
function starts examining the next page slot, the reference counter
of the previous page slot could still be positive. In fact, a
“ghost” process could still
reference the page, typically because some memory regions have been
temporarily removed from the process list scanned in Step 5.
Eventually, try_to_unuse( )
catches every
reference. In the meantime, however, the page is no longer in the
swap cache, it is unlocked, and a copy is still included in the page
slot of the swap area being deactivated.
One might expect that this situation could lead to data loss. For instance, suppose that some “ghost” process accesses the page slot and starts swapping the page in. Since the page is no longer in the swap cache, the process fills a new page frame with the data read from disk. However, this page frame would be different from the page frames owned by the processes that are supposed to share the page with the “ghost” process.
This problem does not arise when deactivating a swap area because
interference from a ghost process could happen only if a swapped-out
page belongs to a private anonymous memory mapping.[108] In this case, the page frame is handled by
means of the Copy on Write mechanism described in Chapter 8, so it is perfectly legal to assign different
page frames to the processes that reference the page. However, the
try_to_unuse( )
function marks the page as
“dirty” (Step 9); otherwise, the
try_to_swap_out( )
function might later drop the
page from the Page Table of some process without saving it in an
another swap area (see the later section Section 16.5).
As we shall see later, when freeing memory, the kernel swaps out many pages in a short period of time. It is therefore important to try to store these pages in contiguous slots to minimize disk seek time when accessing the swap area.
A first approach to an algorithm that searches for a free slot could choose one of two simplistic, rather extreme strategies:
Always start from the beginning of the swap area. This approach may increase the average seek time during swap-out operations because free page slots may be scattered far away from one another.
Always start from the last allocated page slot. This approach increases the average seek time during swap-in operations if the swap area is mostly free (as is usually the case) because the handful of occupied page slots may be scattered far away from one another.
Linux adopts a hybrid approach. It always starts from the last allocated page slot unless one of these conditions occurs:
The end of the swap area is reached
SWAPFILE_CLUSTER
(usually 256) free page slots
were allocated after the last restart from the beginning of the swap
area
The cluster_nr
field in the
swap_info_struct
descriptor stores the number of
free page slots allocated. This field is reset to 0 when the function
restarts allocation from the beginning of the swap area. The
cluster_next
field stores the index of the first
page slot to be examined in the next allocation.[109]
To speed up the search for free page slots, the kernel keeps the
lowest_bit
and highest_bit
fields of each swap area descriptor up to date. These fields specify
the first and the last page slots that could be free; in other words,
any page slot below lowest_bit
and above
highest_bit
is known to be occupied.
The scan_swap_map( )
function is used to find a
free page slot in a given swap area. It acts on a single parameter,
which points to a swap area descriptor and returns the index of a
free page slot. It returns 0 if the swap area does not contain any
free slots. The function performs the following steps:
It tries first to use the current cluster. If the
cluster_nr
field of the swap area descriptor is
positive, it scans the swap_map
array of counters
starting from the element at index cluster_next
and looks for a null entry. If a null entry is found, it decrements
the cluster_nr
field and goes to Step 4.
If this point is reached, either the cluster_nr
field is null or the search starting from
cluster_next
didn’t find a null
entry in the swap_map
array. It is time to try the
second stage of the hybrid search. It reinitializes
cluster_nr
to SWAPFILE_CLUSTER
and restarts scanning the array from the
lowest_bit
index that is trying to find a group of
SWAPFILE_CLUSTER
free page slots. If such a group
is found, it goes to Step 4.
No group of SWAPFILE_CLUSTER
free page slots
exists. It restarts scanning the array from the
lowest_bit
index that is trying to find a single
free page slot. If no null entry is found, it sets the
lowest_bit
field to the maximum index in the
array, the highest_bit
field to 0, and returns 0
(the swap area is full).
A null entry is found. Puts the value 1 in the entry, decrements
nr_swap_pages
, updates the
lowest_bit
and highest_bit
fields if necessary, and sets the cluster_next
field to the index of the page slot just allocated plus 1.
Returns the index of the allocated page slot.
The get_swap_page( )
function is used to find a
free page slot by searching all the active swap areas. The function,
which returns the index of a newly allocated page slot or 0 if all
swap areas are filled, takes into consideration the different
priorities of the active swap areas.
Two passes are necessary. The first pass is partial and applies only to areas that have a single priority; the function searches such areas in a round-robin fashion for a free slot. If no free page slot is found, a second pass is made starting from the beginning of the swap area list; during this second pass, all swap areas are examined. More precisely, the function performs the following steps:
If nr_swap_pages
is null or if there are no active
swap areas, returns 0.
Starts by considering the swap area pointed to by
swap_list.next
(recall that the swap area list is
sorted by decreasing priorities).
If the swap area is active and not being deactivated, invokes
scan_swap_map( )
to allocate a free page slot. If
scan_swap_map( )
returns a page slot index, the
function’s job is essentially done, but it must
prepare for its next invocation. Thus, it updates
swap_list.next
to point to the next swap area in
the swap area list, if the latter has the same priority (thus
continuing the round-robin use of these swap areas). If the next swap
area does not have the same priority as the current one, the function
sets swap_list.next
to the first swap area in the
list (so that the next search will start with the swap areas that
have the highest priority). The function finishes by returning the
identifier corresponding to the page slot just allocated.
Either the swap area is not writable, or it does not have free page slots. If the next swap area in the swap area list has the same priority as the current one, the function makes it the current one and goes to Step 3.
At this point, the next swap area in the swap area list has a lower priority than the previous one. The next step depends on which of the two passes the function is performing.
If this is the first (partial) pass, it considers the first swap area in the list and goes to Step 3, thus starting the second pass.
Otherwise, it checks if there is a next element in the list; if so, it considers it and goes to Step 3.
At this point the list is completely scanned by the second pass and no free page slot has been found; it returns 0.
The swap_free( )
function is invoked when swapping
in a page to decrement the corresponding swap_map
counter (see Table 16-1). When the counter reaches
0, the page slot becomes free since its identifier is no longer
included in any Page Table entry. We’ll see in the
later section Section 16.3,
however, that the swap cache counts as an owner of the page slot.
The function acts on a single entry
parameter that
specifies a swapped-out page identifier and performs the following
steps:
Derives the swap area index and the offset
page
slot index from the entry
parameter and gets the
address of the swap area descriptor.
Checks whether the swap area is active and returns right away if it is not.
If the swap_map
counter corresponding to the page
slot being freed is smaller than SWAP_MAP_MAX
,
decrements it. Recall that entries that have the
SWAP_MAP_MAX
value are considered persistent
(undeletable).
If the swap_map
counter becomes 0, increments the
value of nr_swap_pages
and updates, if necessary,
the lowest_bit
and highest_bit
fields of the swap area descriptor.
[107] “Permanent” page slots
protect against overflows of swap_map
counters.
Without them, valid page slots could become
“defective” if they are referenced
too many times, thus leading to data losses. However, no one really
expects that a page slot counter could reach the value 32,768.
It’s just a “belt and
suspenders” approach.
[108] Actually, the page might also belong to an IPC shared memory region; Chapter 19 has a discussion of this case.
[109] As you may have noticed, the names of Linux data structures are not always appropriate. In this case, the kernel does not really “cluster” page slots of a swap area.