We saw in Chapter 2 how Linux takes advantage of 80 × 86’s segmentation and paging circuits to translate logical addresses into physical ones. We also mentioned that some portion of RAM is permanently assigned to the kernel and used to store both the kernel code and the static kernel data structures.
The remaining part of the RAM is called dynamic memory . It is a valuable resource, needed not only by the processes but also by the kernel itself. In fact, the performance of the entire system depends on how efficiently dynamic memory is managed. Therefore, all current multitasking operating systems try to optimize the use of dynamic memory, assigning it only when it is needed and freeing it as soon as possible.
This chapter, which consists of three main sections, describes how the kernel allocates dynamic memory for its own use. Section 7.1 and Section 7.2 illustrate two different techniques for handling physically contiguous memory areas, while Section 7.3 illustrates a third technique that handles noncontiguous memory areas.
We saw in Section 2.4 how the Intel Pentium processor can use two different page frame sizes: 4 KB and 4 MB (or 2 MB if PAE is enabled—see Section 2.4.6). Linux adopts the smaller 4 KB page frame size as the standard memory allocation unit. This makes things simpler for two reasons:
The Page Fault exceptions issued by the paging circuitry are easily interpreted. Either the page requested exists but the process is not allowed to address it, or the page does not exist. In the second case, the memory allocator must find a free 4 KB page frame and assign it to the process.
The 4 KB size is a multiple of most disk block sizes, so transfers of data between main memory and disks are more efficient. Yet this smaller size is much more manageable than the 4 MB size.
The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on.
State information of a page frame is kept in a page descriptor of
type struct page
, whose fields are shown in Table 7-1. All page descriptors are stored in the
mem_map
array. Since each descriptor is less than
64 bytes long, mem_map
requires about four page
frames for each megabyte of RAM.
Table 7-1. The fields of the page descriptor
Type |
Name |
Description |
---|---|---|
|
|
Contains pointers to next and previous items in a doubly linked list of page descriptors |
|
|
Used when the page is inserted into the page cache (see Section 14.1) |
|
|
Either the position of the data stored in the page within the page’s disk image (see Chapter 14) or a swapped-out page identifier (see Chapter 16) |
|
|
Contains pointer to next item in a doubly linked circular list of the page cache hash table |
|
|
Page’s reference counter |
|
|
Array of flags (see Table 7-2) |
|
|
Contains pointers to the least recently used doubly linked list of pages |
|
|
Page’s wait queue |
|
|
Contains pointer to previous item in a doubly linked circular list of the page cache hash table |
|
|
Used when the page stores buffers (see Section 13.4.8.2) |
|
|
Linear address of the page frame in the fourth gigabyte (see Section 7.1.5 later in this chapter) |
|
|
The zone to which the page frame belongs (see Section 7.1.2) |
You don’t have to fully understand the role of all fields in the page descriptor right now. In the following chapters, we often come back to the fields of the page descriptor. Moreover, several fields have different meaning, according to whether the page frame is free and what kernel component is using the page frame.
Let’s describe in greater detail two of the fields:
count
A usage reference counter for the page. If it is set to 0, the corresponding page frame is free and can be assigned to any process or to the kernel itself. If it is set to a value greater than 0, the page frame is assigned to one or more processes or is used to store some kernel data structures.
flags
Includes up to 32 flags (see Table 7-2) that
describe the status of the page frame. For each
PG_
xyz flag, the kernel
defines some macros that manipulate its value. Usually, the
Page
Xyz macro returns the
value of the flag, while the
SetPage
Xyz and
ClearPage
Xyz macro set and
clear the corresponding bit, respectively.
Table 7-2. Flags describing the status of a page frame
Flag name |
Meaning |
---|---|
|
The page is involved in a disk I/O operation. |
|
An I/O error occurred while transferring the page. |
|
The page has been recently accessed for a disk I/O operation. |
|
The flag is set after completing a read operation, unless a disk I/O error happened. |
|
The page has been modified (see Section 16.5.1). |
|
The page is in the active or inactive page list (see Section 16.7.2). |
|
The page is in the active page list (see Section 16.7.2). |
|
The page frame is included in a slab (see Section 7.2 later in this chapter). |
|
Not used. |
|
The page frame belongs to the |
|
The flag used by the Ext2 filesystem (see Chapter 17). |
|
Not used on the 80 × 86 architecture. |
|
The page frame is reserved to kernel code or is unusable. |
|
The page is involved in an I/O operation triggered by
|
In an ideal computer architecture, a page frame is a memory storage unit that can be used for anything: storing kernel and user data, buffering disk data, and so on. Any kind of page of data can be stored in any page frame, without limitations.
However, real computer architectures have hardware constraints that may limit the way page frames can be used. In particular, the Linux kernel must deal with two hardware constraints of the 80 × 86 architecture:
The Direct Memory Access (DMA) processors for ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.
In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.
To cope with these two limitations, Linux partitions the physical memory in three zones:
ZONE_DMA
Contains pages of memory below 16 MB
ZONE_NORMAL
Contains pages of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM
Contains pages of memory at and above 896 MB
The ZONE_DMA
zone includes memory pages that can
be used by old ISA-based devices by means of the DMA. (Section 13.1.4 gives further details on DMA.)
The ZONE_DMA
and ZONE_NORMAL
zones include the “normal” pages of
memory that can be directly accessed by the kernel through the linear
mapping in the fourth gigabyte of the linear address space (see Section 2.5.5). Conversely, the
ZONE_HIGHMEM
zone includes pages of memory that
cannot be directly accessed by the kernel through the linear mapping
in the fourth gigabyte of linear address space (see Section 7.1.6 later in this chapter). The
ZONE_HIGHMEM
zone is not used on 64-bit
architectures.
Each memory zone has its own descriptor of type struct zone_struct
(or equivalently, zone_t
).
Its fields are shown in Table 7-3.
Table 7-3. The fields of the zone descriptor
Type |
Name |
Description |
---|---|---|
|
|
Contains a pointer to the conventional name of the zone: “DMA,” “Normal,” or “HighMem” |
|
|
Number of pages in the zone |
|
|
Spin lock protecting the descriptor |
|
|
Number of free pages in the zone |
|
|
Minimum number of pages of the zone that should remain free (see Section 16.7) |
|
|
Lower threshold value for the zone’s page balancing algorithm (see Section 16.7) |
|
|
Upper threshold value for the zone’s page balancing algorithm (see Section 16.7) |
|
|
Flag indicating that the zone’s page balancing algorithm should be activated (see Section 16.7) |
|
|
Used by the buddy system page allocator (see the later section Section 7.1.7) |
|
|
Pointer to the descriptor of the node to which this zone belongs |
|
|
Array of page descriptors of the zone (see the later section Section 7.1.7) |
|
|
First physical address of the zone |
|
|
First page descriptor index of the zone |
The zone
field in the page descriptor points to
the descriptor of the zone to which the corresponding page frame
belongs.
The zone_names
array stores the canonical names of
the three zones: “DMA,”
“Normal,” and
“HighMem.”
When the kernel invokes a memory allocation function, it must specify
the zones that contain the requested page frames. The kernel usually
specifies which zones it’s willing to use. For
instance, if a page frame must be directly mapped in the fourth
gigabyte of linear addresses but it is not going to be used for ISA
DMA transfers, then the kernel requests a page frame either in
ZONE_NORMAL
or in ZONE_DMA
. Of
course, the page frame should be obtained from
ZONE_DMA
only if ZONE_NORMAL
does not have free page frames. To specify the preferred zones in a
memory allocation request, the kernel uses the
struct
zonelist_struct
data
structure (or equivalently zonelist_t
), which is
an array of zone descriptor pointers.
We are used to thinking of the computer’s memory as an homogeneous, shared resource. Disregarding the role of the hardware caches, we expect the time required for a CPU to access a memory location is essentially the same, regardless of the location’s physical address and the CPU. Unfortunately, this assumption is not true in some architectures. For instance, it is not true for some multiprocessor Alpha or MIPS computers.
Linux 2.4 supports the Non-Uniform Memory Access (NUMA) model, in which the access times for different memory locations from a given CPU may vary. The physical memory of the system is partitioned in several nodes . The time needed by any given CPU to access pages within a single node is the same. However, this time might not be the same for two different CPUs. For every CPU, the kernel tries to minimize the number of accesses to costly nodes by carefully selecting where the kernel data structures that are most often referenced by the CPU are stored.
The physical memory inside each node can be split in several zones,
as we saw in the previous section. Each node has a
descriptor of type
pg_data_t
, whose fields are shown in Table 7-4. All node descriptors are stored in a simply
linked list, whose first element is pointed to by the
pgdat_list
variable.
Table 7-4. The fields of the node descriptor
Type |
Name |
Description |
---|---|---|
|
|
Array of zone descriptors of the node |
|
|
Array of |
|
|
Number of zones in the node |
|
|
Array of page descriptors of the node |
|
|
Bitmap of usable physical addresses for the node |
|
|
Used in the kernel initialization phase |
|
|
First physical address of the node |
|
|
First page descriptor index of the node |
|
|
Size of the node (in pages) |
|
|
Identifier of the node |
|
|
Next item in the node list |
As usual, we are mostly concerned with the 80
× 86 architecture. IBM-compatible PCs use the Uniform
Access Memory model (UMA), thus the NUMA support is not really
required. However, even if NUMA support is not compiled in the
kernel, Linux makes use of a single node that includes all system
physical memory; the corresponding descriptor is stored in the
contig_page_data
variable.
On the 80 × 86 architecture, grouping the physical memory in a single node might appear useless; however, this approach makes the memory handling code more portable, because the kernel may assume that the physical memory is partitioned in one or more nodes in all architectures.[53]
Dynamic memory and the values
used to refer to it are illustrated in Figure 7-1.
The zones of memory are now drawn to scale;
ZONE_NORMAL
is usually larger than
ZONE_DMA
, and, if present,
ZONE_HIGHMEM
is usually larger than
ZONE_NORMAL
. Notice that
ZONE_HIGHMEM
starts from physical address
0x38000000
, which corresponds to 896 MB.
We already described how the paging_init( )
function initializes the kernel Page Tables according to the amount
of RAM in the system in Section 2.5.5.
Beside Page Tables, the paging_init( )
function
also initializes other memory handling data structures. It invokes
kmap_init( )
, which essentially sets up the
kmap_pte
variable to create
“windows” of linear addresses that
allow the kernel to address the ZONE_HIGHMEM
zone
(see Section 7.1.6.2 later
in this chapter). Then, paging_init( )
invokes the
free_area_init( )
function, passing an array
storing the sizes of the three memory zones to it.
The free_area_init( )
function sets up both the
zone descriptors and the page descriptors. The function receives the
zones_size
array (size of each memory zone) as its
parameter, and executes the following operations:[54]
Computes the total number of page frames in RAM by adding the value
in zones_size
, and stores the result in the
totalpages
local variable.
Initializes the active_list
and
inactive_list
lists of page descriptors (see Chapter 16).
Allocates space for the mem_map
array of page
descriptors. The space needed is the product of
totalpages
by the page descriptor size.
Initializes some fields of the node descriptor
contig_page_data
:
contig_page_data.node_size = totalpages; contig_page_data.node_start_paddr = 0x00000000; contig_page_data.node_start_mapnr = 0;
Initializes some fields of all page descriptors. All page frames are
marked as reserved, but later, the PG_reserved
flag of the page frames in dynamic memory will be cleared:
for (p = mem_map; p < mem_map + totalpages; p++) { p->count = 0; SetPageReserved(p); init_waitqueue_head(&p->wait); p->list.next = p->list.prev = p; }
Stores the address of the memory zone descriptor in the
zone
local variable and for each element of the
zone_names
array (index j
between 0 and 2), performs the following steps:
Initializes some fields of the descriptor:
zone->name = zone_names[j]; zone->size = zones_size[j]; zone->lock = SPIN_LOCK_UNLOCKED; zone->zone_pgdat = & contig_page_data; zone->free_pages = 0; zone->need_balance = 0;
If the zone is empty (that is, it does not include any page frame), the function goes back to the beginning of Step 6 and continues with the next zone.
Otherwise, the zone includes at least one page frame and the function
initializes the pages_min
,
pages_low
, and pages_high
fields of the zone descriptor (see Chapter 16).
Sets up the zone_mem_map
field of the zone
descriptor to the address of the first page descriptor in the zone.
Sets up the zone_start_mapnr
field of the zone
descriptor to the index of the first page descriptor in the zone.
Sets up the zone_start_paddr
field of the zone
descriptor to the physical address of the first page frame in the
zone.
Stores the address of the zone descriptor in the
zone
field of the page descriptor for each page
frame of the zone.
If the zone is either ZONE_DMA
or
ZONE_NORMAL
, stores the linear address in the
fourth gigabyte that maps the page frame into the
virtual
field of every page descriptor of the
zone.
Initializes the free_area_t
structures in the
free_area
array of the zone descriptor (see
Section 7.1.7 later in this
chapter).
Initializes the node_zonelists
array of the
contig_page_data
node descriptor. The array
includes 16 elements; each element corresponds to a different type of
memory request and specifies the zones (in order of preference) from
where the page frames could be retrieved. See Section 7.1.5 later in this chapter for
further details.
When the paging_init( )
function terminates,
dynamic memory is not yet usable because the
PG_reserved
flag of all pages is set. Memory
initialization is further carried on by the mem_init( )
function, which is invoked subsequently to
paging_init( )
.
Essentially, the mem_init( )
function initializes
the value of num_physpages
, the total number of
page frames present in the system. It then scans all page frames
associated with the dynamic memory; for each of them, the function
sets the count
field of the corresponding
descriptor to 1, resets the PG_reserved
flag, sets
the PG_highmem
flag if the page belongs to
ZONE_HIGHMEM
, and calls the free_page( )
function on it. Besides releasing the page frame (see
Section 7.1.7 later in this
chapter), free_page( )
also increments the value
of the free_pages
field of the memory zone
descriptor that owns the page frame. The
free_pages
fields of all zone descriptors are used
by the nr_free_pages( )
function to compute the
total number of free page frames in the dynamic memory.
The mem_init( )
function also counts the number of
page frames that are not associated with dynamic memory. Several
symbols produced while compiling the kernel (some are described in
Section 2.5.3) enable the function to
count the number of page frames reserved for the hardware, kernel
code, and kernel data, and the number of page frames used during
kernel initialization that can be successively released.
After having seen how the kernel allocates and initializes the data structures for page frame handling, we now look at how page frames are allocated and released.
Page frames can be requested by using six slightly different
functions and macros. Unless otherwise stated, they return the linear
address of the first allocated page, or return
NULL
if the allocation failed.
alloc_pages(gfp_mask, order)
Function used to request 2order contiguous
page frames. It returns the address of the descriptor of the first
allocated page frame or returns NULL
if the
allocation failed.
alloc_page(gfp_mask)
Macro used to get a single page frame; it expands to:
alloc_pages(gfp_mask, 0)
It returns the address of the descriptor of the allocated page frame
or returns NULL
if the allocation failed.
_ _get_free_pages(gfp_mask, order)
Function that is similar to alloc_pages( )
, but it
returns the linear address of the first allocated page.
_ _get_free_page(gfp_mask)
Macro used to get a single page frame; it expands to:
_ _get_free_pages(gfp_mask, 0)
get_zeroed_page(gfp_mask)
, or equivalently get_free_page(gfp_mask)
Function that invokes:
alloc_pages(gfp_mask, 0)
and then fills the page frame obtained with zeros.
_ _get_dma_pages(gfp_mask, order)
Macro used to get page frames suitable for DMA; it expands to:
_ _get_free_pages(gfp_mask | _ _GFP_DMA, order)
The parameter gfp_mask
specifies how to look for
free page frames. It consists of the following flags:
_ _GFP_WAIT
The kernel is allowed to block the current process waiting for free page frames.
_ _GFP_HIGH
The kernel is allowed to access the pool of free page frames left for recovering from very low memory conditions.
_ _GFP_IO
The kernel is allowed to perform I/O transfers on low memory pages in order to free page frames.
_ _GFP_HIGHIO
The kernel is allowed to perform I/O transfers on high memory pages in order free page frames.
_ _GFP_FS
The kernel is allowed to perform low-level VFS operations.
_ _GFP_DMA
The requested page frames must be included in the
ZONE_DMA
zone (see the earlier section Section 7.1.2.)
_ _GFP_HIGHMEM
The requested page frames can be included in the
ZONE_HIGHMEM
zone.
In practice, Linux uses the predefined combinations of flag values shown in Table 7-5; the group name is what you’ll encounter as argument of the six page frame allocation functions.
Table 7-5. Groups of flag values used to request page frames
Group name |
Corresponding flags |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The _ _GFP_DMA
and _ _GFP_HIGHMEM
flags are called zone modifiers
; they specify the zones searched by
the kernel while looking for free page frames. The
node_zonelists
field of the
contig_page_data
node descriptor is an array of
lists of zone descriptors; each list is associated with one specific
combination of the zone modifiers. Although the array includes 16
elements, only 4 are really used, since there are currently only 2
zone modifiers. They are shown in Table 7-6.
Table 7-6. Zone modifier lists
_ _GFP_DMA |
_ _GFP_HIGHMEM |
Zone list |
---|---|---|
0 |
0 |
|
0 |
1 |
|
1 |
0 |
|
1 |
1 |
|
Page frames can be released through each of the following four functions and macros:
_ _free_pages(page, order)
This function checks the page descriptor pointed to by
page
; if the page frame is not reserved (i.e., if
the PG_reserved
flag is equal to 0), it decrements
the count
field of the descriptor. If
count
becomes 0, it assumes that
2order contiguous page frames starting
from addr
are no longer used. In this case, the
function invokes _ _free_pages_ok( )
to insert the
page frame descriptor of the first free page in the proper list of
free page frames (described in the following section).
free_pages(addr, order)
This function is similar to _ _free_pages( )
, but
it receives as an argument the linear address addr
of the first page frame to be released.
_ _free_page(page)
This macro releases the page frame having the descriptor pointed to
by page
; it expands to:
_ _free_pages(page, 0)
free_page(addr)
This macro releases the page frame having the linear address
addr
; it expands to:
free_pages(addr, 0)
Page frames above the 896 MB boundary are not mapped in the fourth gigabyte of the kernel linear address spaces, so they cannot be directly accessed by the kernel. This implies that any page allocator function that returns the linear address of the assigned page frame doesn’t work for the high memory.
For instance, suppose that the kernel invoked _ _get_free_pages(GFP_HIGHMEM,0)
to allocate a page frame in
high memory. If the allocator assigned a page frame in high memory,
_ _get_free_pages( )
cannot return its linear
address because it doesn’t exist; thus, the function
returns NULL
. In turn, the kernel cannot use the
page frame; even worse, the page frame cannot be released because the
kernel has lost track of it.
In short, allocation of high-memory page frames must be done only
through the alloc_pages( )
function and its
alloc_page( )
shortcut, which both return the
address of the page descriptor of the first allocated page frame.
Once allocated, a high-memory page frame has to be mapped into the
fourth gigabyte of the linear address space, even though the physical
address of the page frame may well exceed 4 GB.
To do this, the kernel may use three different mechanisms, which are called permanent kernel mappings, temporary kernel mappings, and noncontiguous memory allocation. In this section, we focus on the first two techniques; the third one is discussed in Section 7.3 later in this chapter.
Establishing a permanent kernel mapping may block the current process; this happens when no free Page Table entries exist that can be used as “windows” on the page frames in high memory (see the next section). Thus, a permanent kernel mapping cannot be used in interrupt handlers and deferrable functions. Conversely, establishing a temporary kernel mapping never requires blocking the current process; its drawback, however, is that very few temporary kernel mappings can be established at the same time.
Of course, none of these techniques allow addressing the whole RAM simultaneously. After all, only 128 MB of linear address space are left for mapping the high memory, while PAE supports systems having up to 64 GB of RAM.
Permanent kernel mappings allow the
kernel to establish long-lasting mappings of high-memory page frames
into the kernel address space. They use a dedicated Page Table whose
address is stored in the pkmap_page_table
variable. The number of entries in the Page Table is yielded by the
LAST_PKMAP
macro. As usual, the Page Table
includes either 512 or 1,024 entries, according to whether PAE is
enabled or disabled (see Section 2.4.6);
thus, the kernel can access at most 2 or 4 MB of high memory at once.
The Page Table maps the linear addresses starting from
PKMAP_BASE
(usually
0xfe000000
). The address of the descriptor
corresponding to the first page frame in high memory is stored in the
highmem_start_page
variable.
The pkmap_count
array includes
LAST_PKMAP
counters, one for each entry of the
pkmap_page_table
Page Table. We distinguish three
cases:
The corresponding Page Table entry does not map any high-memory page frame and is usable.
The corresponding Page Table entry does not map any high-memory page frame, but it cannot be used because the corresponding TLB entry has not been flushed since its last usage.
The corresponding Page Table entry maps a high-memory page frame, which is used by exactly n-1 kernel components.
The kmap( )
function establishes a permanent
kernel mapping. It is essentially equivalent to the following code:
void * kmap(struct page * page) { if (page < highmem_page_start) return page->virtual; return kmap_high(page); }
The virtual
field of the page descriptor stores
the linear address in the fourth gigabyte mapping the page frame, if
any. Thus, for any page frame below the 896 MB boundary, the field
always includes the physical address of the page frame plus
PAGE_OFFSET
. Conversely, if the page frame is in
high memory, the virtual
field has a non-null
value only if the page frame is currently mapped, either by the
permanent or the temporary kernel mapping.
The kmap_high( )
function is invoked if the page
frame really belongs to the high memory. The function is essentially
equivalent to the following code:
void * kmap_high(struct page * page) { unsigned long vaddr; spin_lock(&kmap_lock); vaddr = (unsigned long) page->virtual; if (!vaddr) vaddr = map_new_virtual(page); pkmap_count[(vaddr-PKMAP_BASE) >> PAGE_SHIFT]++; spin_unlock(&kmap_lock); return (void *) vaddr; }
The function gets the kmap_lock
spin lock to
protect the Page Table against concurrent accesses in multiprocessor
systems. Notice that there is no need to disable the interrupts
because kmap( )
cannot be invoked by interrupt
handlers and deferrable functions. Next, the kmap_high( )
function checks whether the virtual
field of the page descriptor already stores a non-null linear
address. If not, the function invokes the map_new_virtual( )
function to insert the page frame physical address in an
entry of pkmap_page_table
. Then
kmap_high( )
increments the counter corresponding
to the linear address of the page frame by 1 because another kernel
component is going to access the page frame. Finally,
kmap_high( )
releases the
kmap_lock
spin lock and returns the linear address
that maps the page.
The map_new_virtual( )
function essentially
executes two nested loops:
for (;;) { int count; DECLARE_WAITQUEUE(wait, current); for (count = LAST_PKMAP; count > 0; --count) { last_pkmap_nr = (last_pkmap_nr + 1) & (LAST_PKMAP - 1); if (!last_pkmap_nr) { flush_all_zero_pkmaps( ); count = LAST_PKMAP; } if (!pkmap_count[last_pkmap_nr]) { unsigned long vaddr = PKMAP_BASE + (last_pkmap_nr << PAGE_SHIFT); set_pte(&(pkmap_page_table[last_pkmap_nr]), mk_pte(page, 0x63)); pkmap_count[last_pkmap_nr] = 1; page->virtual = (void *) vaddr; return vaddr; } } current->state = TASK_UNITERRUPTIBLE; add_wait_queue(&pkmap_map_wait, &wait); spin_unlock(&kmap_lock); schedule( ); remove_wait_queue(&pkmap_map_wait, &wait); spin_lock(&kmap_lock); if (page->virtual) return (unsigned long) page->virtual; }
In the inner loop, the function scans all counters in
pkmap_count
that are looking for a null value. The
last_pkmap_nr
variable stores the index of the
last used entry in the pkmap_page_table
Page
Table. Thus, the search starts from where it was left in the last
invocation of the map_new_virtual( )
function.
When the last counter in pkmap_count
is reached,
the search restarts from the counter at index 0. Before continuing,
however, map_new_virtual( )
invokes the
flush_all_zero_pkmaps( )
function, which starts
another scanning of the counters looking for the value 1. Each
counter that has value 1 denotes an entry in
pkmap_page_table
that is free but cannot be used
because the corresponding TLB entry has not yet been flushed.
flush_all_zero_pkmaps( )
issues the TLB flushes on
such entries and resets their counters to zero.
If the inner loop cannot find a null counter in
pkmap_count
, the map_new_virtual( )
function blocks the current process until some other
process releases an entry of the pkmap_page_table
Page Table. This is achieved by inserting current
in the pkmap_map_wait
wait queue, setting the
current
state to
TASK_UNINTERRUPTIBLE
and invoking
schedule( )
to relinquish the CPU. Once the
process is awoken, the function checks whether another process has
mapped the page by looking at the virtual
field of
the page descriptor; if some other process has mapped the page, the
inner loop is restarted.
When a null counter is found by the inner loop, the
map_new_virtual( )
function:
Computes the linear address that corresponds to the counter.
Writes the page’s physical address into the entry in
pkmap_page_table
. The function also sets the bits
Accessed
, Dirty
,
Read/Write
, and Present
(value
0x63
) in the same entry.
Sets to 1 the pkmap_count
counter.
Writes the linear address into the virtual
field
of the page descriptor.
Returns the linear address.
The kunmap( )
function destroys a permanent kernel
mapping. If the page is really in the high memory zone, it invokes
the kunmap_high( )
function, which is essentially
equivalent to the following code:
void kunmap_high(struct page * page) { spin_lock(&kmap_lock); if ((--pkmap_count[((unsigned long) page->virtual-PKMAP_BASE)>>PAGE_SHIFT])==1) wake_up(&pkmap_map_wait); spin_unlock(&kmap_lock); }
Notice that if the counter of the Page Table entry becomes equal to 1
(free), kunmap_high( )
wakes up the processes
waiting in the pkmap_map_wait
wait
queue.
Temporary kernel mappings are simpler to implement than permanent kernel mappings; moreover, they can be used inside interrupt handlers and deferrable functions because they never block the current process.
Any page frame in high memory can be mapped through a window in the kernel address space—namely, a Page Table entry that is reserved for this purpose. The number of windows reserved for temporary kernel mappings is quite small.
Each CPU has its own set of five windows whose linear addresses are
identified by the enum km_type
data structure:
enum km_type { KM_BOUNCE_READ, KM_SKB_DATA, KM_SKB_DATA_SOFTIRQ, KM_USER0, KM_USER1, KM_TYPE_NR }
The kernel must ensure that the same window is never used by two
kernel control paths at the same time. Thus, each symbol is named
after the kernel component that is allowed to use the corresponding
window. The last symbol, KM_TYPE_NR
, does not
represent a linear address by itself, but yields the number of
different windows usable by every CPU.
Each symbol in km_type
, except the last one, is an
index of a fix-mapped linear address (see Section 2.5.6). The enum
fixed_addresses
data structure includes the
symbols FIX_KMAP_BEGIN
and
FIX_KMAP_END
; the latter is assigned to the index
FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1
. In this
manner, there are KM_TYPE_NR
fix-mapped linear
addresses for each CPU in the system. Furthermore, the kernel
initializes the kmap_pte
variable with the address
of the Page Table entry corresponding to the
fix_to_virt(FIX_KMAP_BEGIN
) linear address.
To establish a temporary kernel mapping, the kernel invokes the
kmap_atomic( )
function, which is essentially
equivalent to the following code:
void * kmap_atomic(struct page * page, enum km_type type) { enum fixed_addresses idx; if (page < highmem_start_page) return page->virtual; idx = type + KM_TYPE_NR * smp_processor_id( ); set_pte(kmap_pte-idx, mk_pte(page, 0x063)); _ _flush_tlb_one(fix_to_virt(FIX_KMAP_BEGIN+idx)); }
The type
argument and the CPU identifier specify
what fix-mapped linear address has to be used to map the request
page. The function returns the linear address of the page frame if it
doesn’t belong to high memory; otherwise, it sets up
the Page Table entry corresponding to the fix-mapped linear address
with the page’s physical address and the bits
Present
, Accessed
,
Read/Write
, and Dirty
. Finally,
the TLB entry corresponding to the linear address is flushed.
To destroy a temporary kernel mapping, the kernel uses the
kunmap_atomic( )
function. In the 80 ×
86 architecture, however, this function does nothing.
Temporary kernel mappings should be used carefully. A kernel control path using a temporary kernel mapping must never block, because another kernel control path might use the same window to map some other high memory page.
The kernel must establish a robust and efficient strategy for allocating groups of contiguous page frames. In doing so, it must deal with a well-known memory management problem called external fragmentation : frequent requests and releases of groups of contiguous page frames of different sizes may lead to a situation in which several small blocks of free page frames are “scattered” inside blocks of allocated page frames. As a result, it may become impossible to allocate a large block of contiguous page frames, even if there are enough free pages to satisfy the request.
There are essentially two ways to avoid external fragmentation:
Use the paging circuitry to map groups of noncontiguous free page frames into intervals of contiguous linear addresses.
Develop a suitable technique to keep track of the existing blocks of free contiguous page frames, avoiding as much as possible the need to split up a large free block to satisfy a request for a smaller one.
The second approach is preferred by the kernel for three good reasons:
In some cases, contiguous page frames are really necessary, since contiguous linear addresses are not sufficient to satisfy the request. A typical example is a memory request for buffers to be assigned to a DMA processor (see Chapter 13). Since the DMA ignores the paging circuitry and accesses the address bus directly while transferring several disk sectors in a single I/O operation, the buffers requested must be located in contiguous page frames.
Even if contiguous page frame allocation is not strictly necessary, it offers the big advantage of leaving the kernel paging tables unchanged. What’s wrong with modifying the Page Tables? As we know from Chapter 2, frequent Page Table modifications lead to higher average memory access times, since they make the CPU flush the contents of the translation lookaside buffers.
Large chunks of contiguous physical memory can be accessed by the kernel through 4 MB pages. The reduction of translation lookaside buffers misses, in comparison to the use of 4 KB pages, significantly speeds up the average memory access time (see Section 2.4.8).
The technique adopted by Linux to solve the external fragmentation problem is based on the well-known buddy system algorithm. All free page frames are grouped into 10 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512 contiguous page frames, respectively. The physical address of the first page frame of a block is a multiple of the group size—for example, the initial address of a 16-page-frame block is a multiple of 16 × 212 (212 = 4,096, which is the regular page size).
We’ll show how the algorithm works through a simple example.
Assume there is a request for a group of 128 contiguous page frames (i.e., a half-megabyte). The algorithm checks first to see whether a free block in the 128-page-frame list exists. If there is no such block, the algorithm looks for the next larger block—a free block in the 256-page-frame list. If such a block exists, the kernel allocates 128 of the 256 page frames to satisfy the request and inserts the remaining 128 page frames into the list of free 128-page-frame blocks. If there is no free 256-page block, the kernel then looks for the next larger block (i.e., a free 512-page-frame block). If such a block exists, it allocates 128 of the 512 page frames to satisfy the request, inserts the first 256 of the remaining 384 page frames into the list of free 256-page-frame blocks, and inserts the last 128 of the remaining 384 page frames into the list of free 128-page-frame blocks. If the list of 512-page-frame blocks is empty, the algorithm gives up and signals an error condition.
The reverse operation, releasing blocks of page frames, gives rise to the name of this algorithm. The kernel attempts to merge pairs of free buddy blocks of size b together into a single block of size 2b. Two blocks are considered buddies if:
Both blocks have the same size, say b.
They are located in contiguous physical addresses.
The physical address of the first page frame of the first block is a multiple of 2 × b × 212.
The algorithm is iterative; if it succeeds in merging released blocks, it doubles b and tries again so as to create even bigger blocks.
Linux uses a different buddy system for each zone. Thus, in the 80 × 86 architecture, there are three buddy systems: the first handles the page frames suitable for ISA DMA, the second handles the “normal” page frames, and the third handles the high-memory page frames. Each buddy system relies on the following main data structures:
The mem_map
array introduced previously. Actually,
each zone is concerned with a subset of the
mem_map
elements. The first element in the subset
and its number of elements are specified, respectively, by the
zone_mem_map
and size
fields of
the zone descriptor.
An array having 10 elements of type free_area_t
,
one element for each group size. The array is stored in the
free_area
field of the zone descriptor.
Ten binary arrays named bitmaps , one for each group size. Each buddy system has its own set of bitmaps, which it uses to keep track of the blocks it allocates.
The free_area_t
(or equivalently,
struct
free_area_struct
) data
structure is defined as follows:
typedef struct free_area_struct { struct list_head free_list; unsigned long *map; } free_area_t;
The k
th element of
the free_area
array in the zone descriptor is
associated with a doubly linked circular list of blocks of size
2k; each member of such a list is the
descriptor of the first page frame of a block. The list is
implemented through the list
field of the page
descriptor.
The map
field points to a bitmap whose size
depends on the number of page frames in the zone. Each bit of the
bitmap of the k
th
entry of the free_area
array describes the status
of two buddy blocks of size 2k page
frames. If a bit of the bitmap is equal to 0, either both buddy
blocks of the pair are free or both are busy; if it is equal to 1,
exactly one of the blocks is busy. When both buddies are free, the
kernel treats them as a single free block of size
2k+1.
Let’s consider, for sake of illustration, a zone
including 128 MB of RAM. The 128 MB can be divided into 32,768 single
pages, 16,384 groups of 2 pages each, or 8,192 groups of 4 pages
each, and so on up to 64 groups of 512 pages each. So the bitmap
corresponding to free_area[0]
consists of 16,384
bits, one for each pair of the 32,768 existing page frames; the
bitmap corresponding to free_area[1]
consists of
8,192 bits, one for each pair of blocks of two consecutive page
frames; the last bitmap corresponding to
free_area[9]
consists of 32 bits, one for each
pair of blocks of 512 contiguous page frames.
Figure 7-2 illustrates the use of the data
structures introduced by the buddy system algorithm. The array
zone_mem_map
contains nine free page frames
grouped in one block of one (a single page frame) at the top and two
blocks of four further down. The double arrows denote doubly linked
circular lists implemented by the free_list
field.
Notice that the bitmaps are not drawn to scale.
The alloc_pages( )
function is the core of the buddy system allocation algorithm. Any
other allocator function or macro listed in the earlier section Section 7.1.5 ends up invoking
alloc_pages( )
.
The function considers the list of the
contig_page_data.node_zonelists
array
corresponding to the zone modifiers specified in the argument
gfp_mask
. Starting with the first zone descriptor
in the list, it compares the number of free page frames in the zone
(stored in the free_pages
field of the zone
descriptor), the number of requested page frames (argument
order
of alloc_pages( )
), and
the threshold value stored in the pages_low
field
of the zone descriptor. If free_pages
- 2
order is smaller than or equal to
pages_low
, the function skips the zone and
considers the next zone in the list. If no zone has enough free page
frames, alloc_pages( )
restarts the loop, this
time looking for a zone that has at least
pages_min
free page frames. If such a zone
doesn’t exist and if the current process is allowed
to wait, the function invokes balance_classzone( )
, which in turn invokes try_to_free_pages( )
to reclaim enough page frames to satisfy the memory
request (see Section 16.7).
When alloc_pages( )
finds a zone with a suitable
number of free page frames, it invokes the rmqueue( )
function to allocate a block in that zone. The function
takes two arguments: the address of the zone descriptor, and
order
, which denotes the logarithm of the size of
the requested block of free pages (0 for a one-page block, 1 for a
two-page block, and so forth). If the page frames are successfully
allocated, the rmqueue( )
function returns the
address of the page descriptor of the first allocated page frame;
that address is also returned by alloc_pages( )
.
Otherwise, rmqueue( )
returns
NULL
, and alloc_pages( )
consider the next zone in the list.
The rmqueue( )
function is equivalent to the
following fragments. First, a few local variables are declared and
initialized:
free_area_t * area = &(zone->free_area[order]); unsigned int curr_order = order; struct list_head *head, *curr; struct page *page; unsigned long flags; unsigned int index; unsigned long size;
The function disables interrupts and acquires the spin lock of the
zone descriptor because it will alter its fields to allocate a block.
Then it performs a cyclic search through each list for an available
block (denoted by an entry that doesn’t point to the
entry itself), starting with the list for the requested
order
and continuing if necessary to larger
orders. This is equivalent to the following fragment:
spin_lock_irqsave(&zone->lock, flags); do { head = &area->free_list; curr = head->next; if (curr != head) goto block_found; curr_order++; area++; } while (curr_order < 10); spin_unlock_irqrestore(&zone->lock, flags); return NULL;
If the loop terminates, no suitable free block has been found, so
rmqueue( )
returns a NULL
value. Otherwise, a suitable free block has been found; in this case,
the descriptor of its first page frame is removed from the list, the
corresponding bitmap is updated, and the value of free_pages
in the zone descriptor is decreased:
block_found: page = list_entry(curr, struct page, list); list_del(curr); index = page - zone->zone_mem_map; if (curr_order != 9) change_bit(index>>(1+curr_order), area->map); zone->free_pages -= 1UL << order;
If the block found comes from a list of size
curr_order
greater than the requested size
order
, a while
cycle is
executed. The rationale behind these lines of codes is as follows:
when it becomes necessary to use a block of
2k page frames to satisfy a request for
2h page frames (h
< k), the program allocates the last
2h page frames and iteratively reassigns
the first 2k -
2h page frames to the
free_area
lists that have indexes between
h and k:
size = 1 << curr_order; while (curr_order > order) { area--; curr_order--; size >>= 1; /* insert *page as first element in the list and update the bitmap */ list_add(&page->list, &area->free_list); change_bit(index >> (1+curr_order), area->map); /* now take care of the second half of the free block starting at *page */ index += size; page += size; }
Finally, rmqueue( )
releases the spin lock,
updates the count
field of the page descriptor
associated with the selected block, and executes the
return
instruction:
spin_unlock_irqrestore(&zone->lock, flags); atomic_set(&page->count, 1); return page;
As a result, the alloc_pages( )
function returns
the address of the page descriptor of the first page frame allocated.
The
_ _free_pages_ok( )
function implements the buddy
system strategy for freeing page frames. It uses two input
parameters:
page
The address of the descriptor of the first page frame included in the block to be released
order
The logarithmic size of the block
The _ _free_pages_ok( )
function usually inserts
the block of page frames in the buddy system data structures so they
can be used in subsequent allocation requests. One case is an
exception: if the current process is moving pages across memory zones
to rebalance them, the function does not free the page frames, but
inserts the block in a special list of the process. To decide what to
do with the block of page frames, _ _free_pages_ok( )
checks the PF_FREE_PAGES
flag of the
process. It is set only if the process is rebalancing the memory
zones. Anyway, we discuss this special case in Chapter 16; we assume here that the
PF_FREE_PAGES
flag of current
is not set, so _ _free_pages_ok( )
inserts the
block in the buddy system data structures.
The function starts by declaring and initializing a few local variables:
unsigned long flags; unsigned long mask = (~0UL) << order; zone_t * zone = page->zone; struct page * base = zone->zone_mem_map; free_area_t * area = &zone->free_area[order]; unsigned long page_idx = page - base; unsigned long index = page_idx >> (1 + order); struct page * buddy;
The page_idx
local variable contains the index of
the first page frame in the block with respect to the first page
frame of the zone. The index
local variable
contains the bit number corresponding to the block in the bitmap.
The function clears the PG_referenced
and
PG_dirty
flags of the first page frame, then it
acquires the zone spin lock and disables interrupts:
page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty)); spin_lock_irqsave(&zone->lock, flags);
The mask
local variable contains the
two’s complement of
2order. It is used to increment the
counter of free page frames in the zone:
zone->free_pages -= mask;
The function now starts a cycle executed at most 9 -
order
times, once for each possibility for merging
a block with its buddy. The function starts with the smallest-sized
block and moves up to the top size. The condition driving the
while
loop is:
mask + (1 << 9)
In the body of the loop, the bits set in mask
are
shifted to the left at each iteration and the buddy block of the
block that has the number page_idx
is checked:
if (!test_and_change_bit(index, area->map)) break;
If the buddy block is not free, the function breaks out of the cycle;
if it is free, the function detaches it from the corresponding list
of free blocks. The block number of the buddy is derived from
page_idx
by switching a single bit:
buddy = &base[page_idx ^ -mask]; list_del(&buddy->list);
At the end of each iteration, the function updates the
mask
, area
,
index
, and page_idx
local
variables:
mask <<= 1; area++; index >>= 1; page_idx &= mask;
The function then continues the next iteration, trying to merge free blocks twice as large as the ones considered in the previous cycle. When the cycle is finished, the free block obtained cannot be merged further with other free blocks. It is then inserted in the proper list:
list_add(&(base[page_idx].list), &area->free_list);
Finally, the function releases the zone spin lock and returns:
spin_unlock_irqrestore(&zone->lock, flags); return;
[53] We have another example of this kind of design choice: Linux uses three levels of Page Tables even when the hardware architecture defines just two levels (see Section 2.5).
[54] In NUMA architectures, these operations must be performed separately on every node. However, we are focusing on the 80 × 86 architecture, which has just one node.