We already know that it is preferable to map memory areas into sets of contiguous page frames, thus making better use of the cache and achieving lower average memory access times. Nevertheless, if the requests for memory areas are infrequent, it makes sense to consider an allocation schema based on noncontiguous page frames accessed through contiguous linear addresses. The main advantage of this schema is to avoid external fragmentation, while the disadvantage is that it is necessary to fiddle with the kernel Page Tables. Clearly, the size of a noncontiguous memory area must be a multiple of 4,096. Linux uses noncontiguous memory areas in several ways — for instance, to allocate data structures for active swap areas (see Section 16.2.3), to allocate space for a module (see Appendix B), or to allocate buffers to some I/O drivers.
To find a free range of linear
addresses, we can look in the area starting from
PAGE_OFFSET
(usually
0xc0000000
, the beginning of the fourth gigabyte).
Figure 7-7 shows how the fourth gigabyte linear
addresses are used:
The beginning of the area includes the linear addresses that map the
first 896 MB of RAM (see Section 2.5.4);
the linear address that corresponds to the end of the directly mapped
physical memory is stored in the high_memory
variable.
The end of the area contains the fix-mapped linear addresses (see Section 2.5.6).
Starting from PKMAP_BASE
(0xfe000000
), we find the linear addresses used
for the persistent kernel mapping of high-memory page frames (see
Section 7.1.6 earlier in
this chapter).
The remaining linear addresses can be used for noncontiguous memory
areas. A safety interval of size 8 MB (macro
VMALLOC_OFFSET
) is inserted between the end of the
physical memory mapping and the first memory area; its purpose is to
“capture” out-of-bounds memory
accesses. For the same reason, additional safety intervals of size 4
KB are inserted to separate noncontiguous memory areas.
The VMALLOC_START
macro defines the starting
address of the linear space reserved for noncontiguous memory areas,
while VMALLOC_END
defines its ending address.
Each
noncontiguous memory area is associated with a descriptor of type
struct
vm_struct
:
struct vm_struct { unsigned long flags; void * addr; unsigned long size; struct vm_struct * next; };
These descriptors are inserted in a simple list by means of the
next
field; the address of the first element of
the list is stored in the vmlist
variable.
Accesses to this list are protected by means of the
vmlist_lock
read/write spin lock. The
addr
field contains the linear address of the
first memory cell of the area; the size
field
contains the size of the area plus 4,096 (which is the size of the
previously mentioned inter-area safety interval).
The get_vm_area( )
function creates new
descriptors of type struct vm_struct
; its
parameter size
specifies the size of the new
memory area. The function is essentially equivalent to the following:
struct vm_struct * get_vm_area(unsigned long size, unsigned long flags) { unsigned long addr; struct vm_struct **p, *tmp, *area; area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); if (!area) return NULL; size += PAGE_SIZE; addr = VMALLOC_START; write_lock(&vmlist_lock); for (p = &vmlist; (tmp = *p) ; p = &tmp->next) { if (size + addr <= (unsigned long) tmp->addr) { area->flags = flags; area->addr = (void *) addr; area->size = size; area->next = *p; *p = area; write_unlock(&vmlist_lock); return area; } addr = tmp->size + (unsigned long) tmp->addr; if (addr + size > VMALLOC_END) { write_unlock(&vmlist_lock); kfree(area); return NULL; } } }
The function first calls kmalloc( )
to obtain a
memory area for the new descriptor. It then scans the list of
descriptors of type struct vm_struct
looking for
an available range of linear addresses that includes at least
size+4096
addresses. If such an interval exists,
the function initializes the fields of the descriptor and terminates
by returning the initial address of the noncontiguous memory area.
Otherwise, when addr + size
exceeds
VMALLOC_END
, get_vm_area( )
releases the descriptor and returns NULL
.
The vmalloc( )
function allocates a noncontiguous memory area to the kernel. The
parameter size
denotes the size of the requested
area. If the function is able to satisfy the request, it then returns
the initial linear address of the new area; otherwise, it returns a
NULL
pointer:
void * vmalloc(unsigned long size) { void * addr; struct vm_struct *area; size = (size + PAGE_SIZE - 1) & PAGE_MASK; area = get_vm_area(size, VM_ALLOC); if (!area) return NULL; addr = area->addr; if (vmalloc_area_pages((unsigned long) addr, size, GFP_KERNEL|_ _GFP_HIGHMEM, 0x63)) { vfree(addr); return NULL; } return addr; }
The function starts by rounding up the value of the
size
parameter to a multiple of 4,096 (the page
frame size). Then vmalloc( )
invokes
get_vm_area( )
, which creates a new descriptor and
returns the linear addresses assigned to the memory area. The
flags
field of the descriptor is initialized with
the VM_ALLOC
flag, which means that the linear
address range is going to be used for a noncontiguous memory
allocation (we’ll see in Chapter 13 that vm_struct
descriptors
are also used to remap memory on hardware devices). Then the
vmalloc( )
function invokes
vmalloc_area_pages( )
to request noncontiguous
page frames and terminates by returning the initial linear address of
the noncontiguous memory area.
The vmalloc_area_pages( )
function uses four
parameters:
address
The initial linear address of the area.
size
The size of the area.
gfp_mask
The allocation flags passed to the buddy system allocator function.
It is always set to GFP_KERNEL|_ _GFP_HIGHMEM
.
prot
The protection bits of the allocated page frames. It is always set to
0x63
, which corresponds to
Present
, Accessed
,
Read/Write
, and Dirty
.
The function starts by assigning the linear address of the end of the
area to the end
local variable:
end = address + size;
The function then uses the pgd_offset_k
macro to
derive the entry in the master kernel Page Global Directory related
to the initial linear address of the area; it then acquires the
kernel Page Table spin lock:
dir = pgd_offset_k(address); spin_lock(&init_mm.page_table_lock);
The function then executes the following cycle:
while (address < end) { pmd_t *pmd = pmd_alloc(&init_mm, dir, address); ret = -ENOMEM; if (!pmd) break; if (alloc_area_pmd(pmd, address, end - address, gfp_mask, prot)) break; address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; ret = 0; } spin_unlock(&init_mm.page_table_lock); return ret;
In each cycle, it first invokes pmd_alloc( )
to
create a Page Middle Directory for the new area and writes its
physical address in the right entry of the kernel Page Global
Directory. It then calls alloc_area_pmd( )
to
allocate all the Page Tables associated with the new Page Middle
Directory. It adds the constant
222—the size of the range of linear
addresses spanned by a single Page Middle Directory—to the
current value of address
, and it increases the
pointer dir
to the Page Global Directory.
The cycle is repeated until all Page Table entries referring to the noncontiguous memory area are set up.
The alloc_area_pmd( )
function executes a similar
cycle for all the Page Tables that a Page Middle Directory points to:
while (address < end) { pte_t * pte = pte_alloc(&init_mm, pmd, address); if (!pte) return -ENOMEM; if (alloc_area_pte(pte, address, end - address)) return -ENOMEM; address = (address + PMD_SIZE) & PMD_MASK; pmd++; }
The pte_alloc( )
function (see Section 2.5.2) allocates a new Page Table and updates the
corresponding entry in the Page Middle Directory. Next,
alloc_area_pte( )
allocates all the page frames
corresponding to the entries in the Page Table. The value of
address
is increased by
222—the size of the linear address
interval spanned by a single Page Table—and the cycle is
repeated.
The main cycle of alloc_area_pte( )
is:
while (address < end) { unsigned long page; spin_unlock(&init_mm.page_table_lock); page_alloc(gfp_mask); spin_lock(&init_mm.page_table_lock); if (!page) return -ENOMEM; set_pte(pte, mk_pte(page, prot)); address += PAGE_SIZE; pte++; }
Each page frame is allocated through page_alloc( )
. The physical address of the new page frame is written
into the Page Table by the set_pte
and
mk_pte
macros. The cycle is repeated after adding
the constant 4,096 (the length of a page frame) to
address
.
Notice that the Page Tables of the current process are not touched by
vmalloc_area_pages( )
. Therefore, when a process
in Kernel Mode accesses the noncontiguous memory area, a
Page Fault
occurs, since the entries in the process’s Page
Tables corresponding to the area are null. However, the Page Fault
handler checks the faulty linear address against the master kernel
Page Tables (which are init_mm.pgd
Page Global
Directory and its child Page Tables; see Section 2.5.5). Once the handler discovers that a master
kernel Page Table includes a non-null entry for the address, it
copies its value into the corresponding process’s
Page Table entry and resumes normal execution of the process. This
mechanism is described in Section 8.4.
The vfree( )
function
releases noncontiguous memory areas. Its parameter
addr
contains the initial linear address of the
area to be released. vfree( )
first scans the list
pointed to by vmlist
to find the address of the
area descriptor associated with the area to be released:
write_lock(&vmlist_lock); for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) { if (tmp->addr == addr) { *p = tmp->next; vmfree_area_pages((unsigned long)(tmp->addr), tmp->size); write_unlock(&vmlist_lock); kfree(tmp); return; } } write_unlock(&vmlist_lock); printk("Trying to vfree( ) nonexistent vm area (%p) ", addr);
The size
field of the descriptor specifies the
size of the area to be released. The area itself is released by
invoking vmfree_area_pages( )
, while the
descriptor is released by invoking kfree( )
.
The vmfree_area_pages( )
function takes two
parameters: the initial linear address and the size of the area. It
executes the following cycle to reverse the actions performed by
vmalloc_area_pages( )
:
dir = pgd_offset_k(address); while (address < end) { free_area_pmd(dir, address, end - address); address = (address + PGDIR_SIZE) & PGDIR_MASK; dir++; }
In turn, free_area_pmd( )
reverses the actions of
alloc_area_pmd( )
in the cycle:
while (address < end) { free_area_pte(pmd, address, end - address); address = (address + PMD_SIZE) & PMD_MASK; pmd++; }
Again, free_area_pte( )
reverses the activity of
alloc_area_pte( )
in the cycle:
while (address < end) { pte_t page = *pte; pte_clear(pte); address += PAGE_SIZE; pte++; if (pte_none(page)) continue; if (pte_present(page)) { _ _free_page(pte_page(page)); continue; } printk("Whee... Swapped out page in kernel page table "); }
Each page frame assigned to the noncontiguous memory area is released
by means of the buddy system _ _free_ page( )
function. The corresponding entry in the Page Table is set to 0 by
the pte_clear
macro.
As for vmalloc( )
, the kernel modifies the entries
of the master kernel Page Global Directory and its child Page Tables
(see Section 2.5.5), but it leaves
unchanged the entries of the process Page Tables mapping the fourth
gigabyte. This is fine because the kernel never reclaims Page Middle
Directories and Page Tables rooted at the master kernel Page Global
Directory.
For instance, suppose that a process in Kernel Mode accessed a
noncontiguous memory area that later got released. The
process’s Page Global Directory entries are equal to
the corresponding entries of the master kernel Page Global Directory,
thanks to the mechanism explained in Section 8.4; they point to the same Page Middle
Directories and Page Tables. The vmfree_area_pages( )
function clears only the entries of the Page Tables
(without reclaiming the Page Tables themselves). Further accesses of
the process to the released noncontiguous memory area will trigger
Page Faults because of the null Page Table entries. However, the
handler will consider such accesses a bug because the master kernel
Page Tables do not include valid entries.