Linux systems use an illusion referred to as "virtual memory." This mechanism makes every memory address virtual, which means they do not point to any address in the RAM directly. This way, whenever we access a memory location, a translation mechanism is performed in order to match the corresponding physical memory.
In this chapter, we will deal with the whole Linux memory allocation and management system, covering the following topics:
Though system memory (also known as RAM) can be extended in some computers that allow it, physical memory is a limited resource in computer systems.
Virtual memory is a concept, an illusion given to each process so that it thinks it has large and almost infinite memory, and sometimes more than the system really has. To set up everything, we will introduce the address space, virtual or logical address, physical address, and bus address terms:
As the MMU is the centerpiece of memory management, it organizes memory into logical units of fixed size called pages. The size of a page is a power of 2 in bytes and varies among systems. A page is backed by a page frame, and the size of the page matches a page frame. Before going further in our learning of memory management, let's introduce the other terms:
Finally, the term "page-aligned" is used to qualify an address that starts exactly at the beginning of a page. It goes without saying that any memory whose address is a multiple of the system page size is said to be page-aligned. For example, on a 4 KB page size system, 4.096, 20.480, and 409.600 are instances of page-aligned memory addresses.
Note
The page size is fixed by the MMU, and the operating system can't modify it. Some processors allow for multiple page sizes (for example, ARMv8-A supports three different granule sizes: 4 KB, 16 KB, and 64 KB), and the OS can decide which one to use. 4 KB is a widely popular page granularity though.
Now that the terms frequently used while dealing with memory have been introduced, let's focus on memory management and organization by the kernel.
Linux is a virtual memory operating system. On a Linux running system, each process and even the kernel itself (and some devices) is allocated address space, which is some portion of the processor's virtual address space (note that neither the kernel nor processes deal with physical addresses – only MMU does). While this virtual address space is split between the kernel and user space, the upper part is used for the kernel and the lower part is used for user space.
The split varies between architectures and is held by the CONFIG_PAGE_OFFSET kernel configuration option. For 32-bit systems, the split is at 0xC0000000 by default. This is called the 3 GB/1 GB split, where the user space is given the lower 3 GB of virtual address space. The kernel can, however, be given a different amount of address space as desired by playing with the CONFIG_VMSPLIT_1G, CONFIG_VMSPLIT_2G, and CONFIG_VMSPLIT_3G_OPT kernel configuration options (see arch/x86/Kconfig and arch/arm/Kconfig). For 64-bit, the split varies by architecture, but it's high enough: 0x8000000000000000 for 64-bit ARM, and 0xffff880000000000 for x86_64.
A typical process's virtual address space layout looks like the following on a 32-bit system with the default splitting scheme:
While this layout is transparent on 64-bit systems, there are particularities on 32-bit machines that need to be introduced. In the next sections, we will study in detail the reason for this memory splitting, its usage, and where it applies.
In an ideal world, all memory is permanently mappable. There are, however, some restrictions on 32-bit systems. This results in only a portion of RAM being permanently mapped. This part of memory can be accessed directly (by simple dereference) by the kernel and is called low memory, while the part of (physical) memory not covered by a permanent mapping is referred to as high memory. There are various architecture-dependent constraints on where exactly that border lies. For example, Intel cores can permanently map only up to the first 1 GB of RAM. This is a little bit less, 896 MiB of RAM, because part of this low memory is used to dynamically map high memory:
In the preceding diagram, we can see that 128 MB of the kernel address space is used to map a high memory of RAM on the fly when needed. On the other hand, 896 MB of kernel address space is permanently and linearly mapped to a low 896 MB of RAM.
The high-memory mechanism can also be used on a 1 GB RAM system to dynamically map user memory whenever the kernel needs access. The fact that the kernel can map the whole RAM into its address space does not mean the user space can't access it. More than one mapping to a RAM page frame can exist; it can be both permanently mapped to the kernel memory space and mapped to some address in the user space when the process is chosen for execution.
Note
Given a virtual address, you can distinguish whether it is a kernel space or a user space address by using the process layout shown previously. Every address below PAGE_OFFSET comes from the user space; otherwise, it is from the kernel.
The first 896 MB of kernel address space constitutes the low memory region. Early in the boot process, the kernel permanently maps that 896 MB onto physical RAM. Addresses that result from that mapping are called logical addresses. These are virtual addresses but can be translated into physical addresses by subtracting a fixed offset, since the mapping is permanent and known in advance. Low memory matches with the lower bound of physical addresses. You can define low memory as being the memory for which logical addresses exist in the kernel space. The core of the kernel stays in low memory. Therefore, most of the kernel memory functions return low memory. In fact, to serve different purposes, kernel memory is divided into zones. Actually, the first 16 MB of LOWMEM is reserved for Direct Memory Access (DMA) usage. Hardware does not always allow you to treat all pages as identical because of limitations. We can then identify three different memory zones in the kernel space:
However, on a 512 MB system, there will be no ZONE_HIGHMEM, 16 MB for ZONE_DMA, and 496 MB for ZONE_NORMAL.
From all the preceding, we can complete the definition of logical addresses, adding that these are addresses in kernel space mapped linearly on physical addresses, and that the corresponding physical address can be obtained by using an offset. Kernel virtual addresses are similar to logical addresses in that they are mappings from a kernel-space address to a physical address. However, the difference is that kernel virtual addresses do not always have the same linear, one-to-one mapping to physical locations as logical addresses do.
Note
You can convert a physical address into a logical address using the __pa(address) macro and the revert with the __va(address) macro.
The top 128 MB of the kernel address space is called a high-memory region. It is used by the kernel to temporarily map physical memory above 1 GB (above 896 MB in reality). When physical memory above 1 GB (or, more precisely, 896MB) needs to be accessed, the kernel uses that 128 MB to create a temporary mapping to its virtual address space, thus achieving the goal of being able to access all physical pages. You can define high memory as being memory for which logical addresses do not exist and which is not mapped permanently into a kernel address space. The physical memory above 896 MB is mapped on demand to the 128 MB of the HIGHMEM region.
Mapping to access high memory is created on the fly by the kernel and destroyed when done. This makes high memory access slower. However, the concept of high memory does not exist on 64-bit systems, due to the huge address range (264 TB), where the 3 GB/1 GB (or any similar split scheme) split does not make sense anymore.
On a Linux system, each process is represented in the kernel as an instance of struct task_struct (see include/linux/sched.h), which characterizes and describes this process. Before the process starts running, it is allocated a table of memory mapping, stored in a variable of the struct mm_struct type (see include/linux/mm_types.h). This can be verified by looking at the following excerpt of the struct task_struct definition, which embeds pointers to elements of the struct mm_struct type:
struct task_struct{
[…]
struct mm_struct *mm, *active_mm;
[…]
}
In the kernel, there is a global variable that always points to the current process, current, and the current->mm field points to the current process memory-mapping table. Before going further in our explanation, let's have a look at the following excerpt of a struct mm_struct data structure:
struct mm_struct {
struct vm_area_struct *mmap;
unsigned long mmap_base;
unsigned long task_size;
unsigned long highest_vm_end;
pgd_t * pgd;
atomic_t mm_users;
atomic_t mm_count;
atomic_long_t nr_ptes;
#if CONFIG_PGTABLE_LEVELS > 2
atomic_long_t nr_pmds;
#endif
int map_count;
spinlock_t page_table_lock;
unsigned long total_vm;
unsigned long locked_vm;
unsigned long pinned_vm;
unsigned long data_vm;
unsigned long exec_vm;
unsigned long stack_vm;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
/* ref to file /proc/<pid>/exe symlink points to */
struct file __rcu *exe_file;
};
I intentionally removed some fields we are not interested in. There are some fields we will talk about later: pgd for example, which is a pointer to the process's base (first entry) level one table (Page Global Directory, abbreviated PGD), written in the translation table base address of the CPU at context switching. For a better understanding of this data structure, we can use the following diagram:
From a process point of view, a memory mapping can be seen as nothing but a set of page table entries dedicated to a consecutive virtual address range. That "consecutive virtual address range" is referred to as a memory area, or a Virtual Memory Area (VMA). Each memory mapping is described by a start address and length, permissions (such as whether the program can read, write, or execute from that memory), and associated resources (such as physical pages, swap pages, and file contents).
mm_struct has two ways to store process regions (VMAs):
Now that we have had an overview of a process address space and have seen that it is made of a set of virtual memory regions, let's dive into the details and study the mechanisms behind these memory regions.
In the kernel, process memory mappings are organized into areas, each referred to as a VMA. For your information, in each running process on a Linux system, the code section, each mapped file region (a library, for example), or each distinct memory mapping (if any) is materialized by a VMA. A VMA is an architecture-independent structure, with permissions and access control flags, defined by a start address and a length. Their sizes are always a multiple of the page size (PAGE_SIZE). A VMA consists of a few pages, each of which has an entry in the page table (the Page Table Entry (PTE)).
A VMA is represented in the kernel as an instance of struct vma_area, defined as the following:
struct vm_area_struct {
unsigned long vm_start;
unsigned long vm_end;
struct vm_area_struct *vm_next, *vm_prev;
struct mm_struct *vm_mm;
pgprot_t vm_page_prot;
unsigned long vm_flags;
unsigned long vm_pgoff;
struct file * vm_file;
[...]
}
For the sake of readability and understandability of this section, only elements that are relevant for us have been listed. However, the following are the meanings of the remaining elements:
The following diagram is an overview of a process memory mapping, highlighting each VMA and describing some of its structure elements:
The preceding image (from http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) describes a process's (started from /bin/gonzo) memory mappings (VMAs). We can see interactions between struct task_struct and its address space element (mm), which then lists and describes each VMA (the start, the end, and the backing file).
You can use the find_vma() function to find the VMA that corresponds to a given virtual address. find_vma() is declared in linux/mm.h as the following:
extern struct vm_area_struct * find_vma(
struct mm_struct * mm, unsigned long addr);
This function searches and returns the first VMA that satisfies vm_start <= addr < vm_end or returns NULL if none is found. mm is the process address space to search in. For the current process, it can be current->mm. The following is an example:
struct vm_area_struct *vma =
find_vma(task->mm, 0x603000);
if (vma == NULL) /* Not found ? */
return -EFAULT;
/* Beyond the end of returned VMA ? */
if (0x13000 >= vma->vm_end)
return -EFAULT;
The preceding code excerpt will look for a VMA whose memory bounds contain 0x603000.
Given a process whose identifier is <PID>, the whole memory mappings of this process can be obtained by reading the /proc/<PID>/maps, /proc/<PID>/smaps, and /proc/<PID>/pagemap files. The following lists the mappings of a running process, whose Process Identifier (PID) is 1073:
# cat /proc/1073/maps
00400000-00403000 r-xp 00000000 b3:04 6438 /usr/sbin/net-listener
00602000-00603000 rw-p 00002000 b3:04 6438 /usr/sbin/net-listener
00603000-00624000 rw-p 00000000 00:00 0 [heap]
7f0eebe4d000-7f0eebe54000 r-xp 00000000 b3:04 11717 /usr/lib/libffi.so.6.0.4
7f0eebe54000-7f0eec054000 ---p 00007000 b3:04 11717 /usr/lib/libffi.so.6.0.4
7f0eec054000-7f0eec055000 rw-p 00007000 b3:04 11717 /usr/lib/libffi.so.6.0.4
7f0eec055000-7f0eec069000 r-xp 00000000 b3:04 21629 /lib/libresolv-2.22.so
7f0eec069000-7f0eec268000 ---p 00014000 b3:04 21629 /lib/libresolv-2.22.so
[...]
7f0eee1e7000-7f0eee1e8000 rw-s 00000000 00:12 12532 /dev/shm/sem.thk-mcp-231016-sema
[...]
Each line in the preceding excerpt represents a VMA, and the fields correspond to the {address (start-end)} {permissions} {offset} {device (major:minor)} {inode} {pathname (image)} pattern:
Each page allocated to a process belongs to an area, thus any page that does not live in the VMA does not exist and cannot be referenced by the process.
High memory is perfect for user space because its address space must be explicitly mapped. Thus, most high memory is consumed by user applications. __GFP_HIGHMEM and GFP_HIGHUSER are the flags for requesting the allocation of (potentially) high memory. Without these flags, all kernel allocations return only low memory. There is no way to allocate contiguous physical memory from user space in Linux.
Now that VMAs have no more secrets for us, let's describe the hardware concepts involved in the translation to their corresponding physical addresses, if any, or their creation and allocation otherwise.
MMU does not only convert virtual addresses into physical ones but also protects memory from unauthorized access. Given a process, any page that needs to be accessed from this process must exist in one of its VMAs and, thus, must live in the process's page table (every process has its own).
As a recall, memory is organized by chunks of fixed-size named pages for virtual memory and frames for physical memory. The size in our case is 4 KB. However, it is defined and accessible with the PAGE_SIZE macro in the kernel. Remember, however, that page size is imposed by the hardware. Considering a 4 KB page-sized system, bytes 0 to 4095 fall on page 0, bytes 4096 to 8191 fall on page 1, and so on.
The concept of a page table is introduced to manage mapping between pages and frames. Pages are spread over tables so that each PTE corresponds to a mapping between a page and a frame. Each process is then given a set of page tables to describe all of its memory regions.
To walk through pages, each page is assigned an index, called a page number. When it comes to a frame, it is a Page Frame Number (PFN). This way, VMAs (logical addresses, more precisely) are composed of two parts: a page number and an offset. On 32-bit systems, the offset represents the 12 less significant bits of the address, whereas 13 less significant bits represent it on 8 KB page-size systems. The following diagram highlights this concept of addresses split into a page number and an offset:
How does the OS or CPU know which physical address corresponds to a given logical address? They use a page table as a translation table and know that each entry's index is a virtual page number, and the value at this index is the PFN. To access physical memory given a virtual memory, the OS first extracts the offset, the virtual page number, and then walks through the process's page tables to match the virtual page number to the physical page. Once a match occurs, it is then possible to access data in that page frame:
The offset is used to point to the right location in the frame. A page table not only holds mapping between physical and virtual page numbers but also accesses control information (read/write access, privileges, and so on).
The following diagram describes address decoding and page table lookup to point to the appropriate location in the appropriate frame:
The number of bits used to represent the offset is defined by the PAGE_SHIFT kernel macro. PAGE_SHIFT is the number of times needed to left-shift 1 bit to obtain the PAGE_SIZE value. It is also the number of times needed to right-shift a page's logical address to obtain its page number, which is the same for a physical address to obtain its page frame number. This macro is architecture-dependent and also depends on the page granularity. Its value could be considered as the following:
#ifdef CONFIG_ARM64_64K_PAGES
#define PAGE_SHIFT 16
#elif defined(CONFIG_ARM64_16K_PAGES)
#define PAGE_SHIFT 14
#else
#define PAGE_SHIFT 12
#endif
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
The preceding states that by default (whether on ARM or ARM64), PAGE_SHIFT is 12, which means a 4 KB page size. On ARM64, it 14 or 16 when respectively a 16 KB or 64 KB page size is chosen.
With our understanding of address translation, the page table is a partial solution. Let's see why. Most 32-bit architectures require 32 bits (4 bytes) to represent a page table entry. On such systems (32-bit) where each process has its private 3 GB user address space, we need 786,432 entries to characterize and cover a process's address space. It represents too much physical memory spent per process just to store the memory mappings. In fact, a process generally uses a small but scattered portion of its virtual address space. To resolve that issue, the concept of a "level" was introduced. Page tables are hierarchized by level (page level). The space necessary to store a multi-level page table only depends on the virtual address space actually in use, instead of being proportional to the maximum size of the virtual address space. This way, unused memory is no longer represented, and the page table walk-through time is reduced. Moreover, each table entry in level N will point to an entry in the table of level N+1, level 1 being the higher level.
Linux can support up to four levels of paging. However, the number of levels to use is architecture-dependent. The following are descriptions of each lever:
Note
All levels are not always used. The i.MX6's MMU only supports a two-level page table (PGD and PTE), which is the case for almost all 32-bit CPUs. In this case, PUD and PMD are simply ignored.
It is important to know that the MMU does not store any mapping. It is a data structure located in RAM. Instead, there is a special register in the CPU, called the Page Table Base Register (PTBR) or the Translation Table Base Register 0 (TTBR0), which points to the base (entry 0) of the level-1 (top-level) page table (PGD) of the process. It is exactly where the pdg field of struct mm_struct points: current->mm.pgd == TTBR0.
At context switch (when a new process is scheduled and given the CPU), the kernel immediately configures the MMU and updates the PTBR with the new process's pgd. Now, when a virtual address is given to MMU, it uses the PTBR's content to locate the process's level-1 page table (PGD), and then it uses the level-1 index, extracted from the Most-Significant Bits (MSBs) of the virtual address to find the appropriate table entry, which contains a pointer to the base address of the appropriate level-2 page table. Then, from that base address, it uses the level-2 index to find the appropriate entry and so on, until it reaches the PTE. ARM architecture (i.MX6, in our case) has a two-level page table. In this case, the level-2 entry is a PTE and points to the physical page (PFN). Only the physical page is found at this step. To access the exact memory location in the page, the MMU extracts the memory offset, also part of the virtual address, and points to the same offset in the physical page.
For the sake of understandability, the preceding description has been limited to a two-level paging scheme but can easily extended. The following diagram is a representation of this two-level paging scheme:
When a process needs to read from or write into a memory location (of course, we are talking about virtual memory), the MMU performs a translation into that process's page table to find the right entry (PTE). The virtual page number is extracted (from the virtual address) and used by the processor as an index into the process's page table to retrieve its page table entry. If there is a valid page table entry at that offset, the processor takes the page frame number from this entry. If not, it means the process accessed an unmapped area of its virtual memory. A page fault is then raised, and the OS should handle it.
In the real world, address translation requires a page-table walk, and it is not always a one-shot operation. There are at least as many memory accesses as there are table levels. A four-level page table would require four memory accesses. In other words, every virtual access would result in five physical memory accesses. The virtual memory concept would be useless if its access were four times slower than physical access. Fortunately, System-on-Chip (SoC) manufacturers worked hard to find a clever trick to address this performance issue: modern CPUs use a small associative and very fast memory called the Translation Lookaside Buffer (TLB), in order to cache the PTEs of recently accessed virtual pages.
Before the MMU proceeds to address translation, there is another step involved. As there is a cache for recently accessed data, there is also a cache for recently translated addresses. As data cache speeds up the data accessing process, the TLB speeds up virtual address translation (yes, address translation is a time-consuming task). It is a Content-Addressable Memory (CAM), where the key is the virtual address and the value is the physical address. In other words, the TLB is a cache for the MMU. At each memory access, the MMU first checks for recently used pages in the TLB, which contains a few of the virtual address ranges to which physical pages are currently assigned.
On memory access, the CPU walks through the TLB trying to find the virtual page number of the page that is being accessed. This step is called a TLB lookup. When a TLB entry is found (a match occurs), it is called TLB hit, and the CPU just keeps running and uses the PFN found in the TLB entry to calculate the target physical address. There is no page fault when a TLB hit occurs. If a translation can be found in the TLB, virtual memory access will be as fast as physical access. If there is no TLB hit, it is called TLB miss.
On a TLB miss, there are two possibilities. Depending on the processor type, the TLB miss event can be handled by the software, the hardware, or through the MMU:
In both cases, the page fault handler is the same, do_page_fault(). This function is architecture-dependent; for ARM, it is defined in arch/arm/mm/fault.c.
The following is a diagram describing a TLB lookup, a TLB hit, or a TLB miss event:
Page table and page directory entries are architecture-dependent. An OS must make sure the structure of a table corresponds to a structure recognized by the MMU. On ARM processors, the location of the translation table must be written in the control coprocessor 15 (CP15) c2 register, and then enable the caches and the MMU by writing to the CP15 c1 register. Have a look at both http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0056d/BABHJIBH.htm and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0433c/CIHFDBEJ.html for detailed information.
Now that we are comfortable with the address translation schemes and their ease with the TLB, we can talk about memory allocation, which involves manipulating page entries under the hood.
Dealing with memory allocation mechanisms and their APIs
Before jumping to the list of APIs, let's start with the following figure, showing the different memory allocators that exist on a Linux-based system, which we will discuss later:
The preceding diagram is inspired by https://bootlin.com/doc/training/linux-kernel/linux-kernel-slides.pdf. What it shows is that there is an allocation mechanism to satisfy any kind of memory request. Depending on what you need memory for, you can choose the one closest to your goal. The main and lowest level allocator is the page allocator, which allocates memory in units of pages (a page being the smallest memory unit it can deliver). Then comes the slab allocator, which is built on top of the page allocator, getting pages from it and splitting them into smaller memory entities (by means of slabs and caches). This is the allocator on which the kmalloc API relies. While kmalloc can be used to request memory from the slab allocator, we can directly talk to the slab to request memory from its caches, or even build our own caches.
Let's start this memory allocation journey with the main and lowest level allocator, the page allocator, from which the others derivate.
The page allocator is the low-level allocator on the Linux system, the one that serves as a basis for other allocators. This allocator brings with it the concept of the page (virtual) and page frame (physical). So, the system's physical memory is split into fixed-size blocks (called page frames), while virtual memory is organized into fixed-size blocks called pages. Page size is always the same as a page frame size. As this is the lowest-level allocator, a page is the smallest unit of memory that the OS will give to any memory request at a low level. In the Linux kernel, a page is represented as an instance of the struct page structure, which we will manipulate using dedicated APIs, introduced in the next section.
This is the lowest-level allocator. It allocates and deallocates blocks of pages using the buddy algorithm. Pages are allocated in blocks that are to the power of 2 in size (to get the best from the buddy algorithm). That means it can allocate a block of 1 page, 2 pages, 4 pages, 8, 16, and so on. Pages returned from this allocation are physically contiguous. alloc_pages() is the main API and is defined as the following:
struct page *alloc_pages(gfp_t mask, unsigned int order)
The preceding function returns NULL when no page can be allocated. Otherwise, it allocates 2order pages and returns a pointer to an instance of struct page, which points the first page of the reserved block. There is, however, a helper macro, alloc_page(), which can be used to allocate a single page. The following is its definition:
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
This macro wraps alloc_pages() with an order parameter set with 0.
__free_pages() must be used to release memory pages allocated with the alloc_pages() function. It takes a pointer to the first page of the allocated block as a parameter, along with the order, the same that was used for allocation. It is defined as the following:
void __free_pages(struct page *page, unsigned int order);
There are other functions working in the same way, but instead of an instance of struct page, they return the (logical) address of the reserved block. These are __get_free_pages() and __get_free_page(), and the following are their definitions:
unsigned long __get_free_pages(gfp_t mask,
unsigned int order);
unsigned long get_zeroed_page(gfp_t mask);
free_pages() is used to free a page allocated with __get_free_pages(). It takes the kernel address representing the start region of allocated page(s), along with the order, which should be the same as that used for allocation:
free_pages(unsigned long addr, unsigned int order);
Whatever the allocation type is, mask specifies the memory zones from where the pages should be allocated and the behavior of the allocators. The following are possible values:
However, you should note that whether specifying the GFP_HIGHMEM flag with __get_free_pages() (or __get_free_page()) or not, it won't be considered. This flag is masked out in these functions to make sure that the returned address never represents high-memory pages (because of their non-linear/permanent mapping). If you need high memory, use alloc_pages() and then kmap() to access it.
__free_pages() and free_pages() can be mixed. The main difference between them is that free_page() takes a logical address as a parameter, whereas __free_page() takes a struct page structure.
Note
The maximum order that can be used varies between architectures. It depends on the FORCE_MAX_ZONEORDER kernel configuration option, which is 11 by default. In this case, the number of pages you can allocate is 1,024. It means that on a 4 KB-sized system, you can allocate up to 1,024 x 4 KB = 4 MB at maximum. On ARM64, the maximum order varies with the selected page size. If it is a 16 KB page size, the maximum order is 12, and if it is a 64 KB page size, the maximum order is 14. These size limitations per allocation are valid for kmalloc() as well.
There are convenient functions exposed by the kernel to switch back and forth between the struct page instances and their corresponding logical addresses, which can be useful at different moments while dealing with memory. The page_to_virt() function is used to convert a struct page (as returned by alloc_pages(), for example) into a kernel logical address. Alternatively, virt_to_page() takes a kernel logical address and returns its associated struct page instance (as if it was allocated using the alloc_pages() function). Both virt_to_page() and page_to_virt() are declared in <asm/page.h> as the following:
struct page *virt_to_page(void *kaddr);
void *page_to_virt(struct page *pg)
There is another macro, page_address(), which simply wraps page_to_virt() and which is declared as the following:
void *page_address(const struct page *page)
It returns the logical address of the page passed in the parameter.
The slab allocator is the one on which kmalloc() relies. Its main purposes are to eliminate fragmentation caused by memory (de)allocation, which is caused by the buddy system in case of small-size memory allocation, and to speed up memory allocation for commonly used objects.
To allocate memory, the requested size is rounded up to the power of two, and the buddy allocator searches the appropriate list. If no entries exist on the requested list, an entry from the next upper list (which has blocks of twice the size of the previous list) is split into two halves (called buddies). The allocator uses the first half, while the other is added to the next list down. This is a recursive approach, which stops when either the buddy allocator successfully finds a block that can be split or reaches the largest block size and there are no free blocks available.
The following case study is heavily inspired by http://dysphoria.net/OperatingSystems1/4_allocation_buddy_system.html. For example, if the minimum allocation size is 1K bytes, and the memory size is 1 MB, the buddy allocator will create an empty list for 1K byte holes, an empty list for 2K byte holes, one for 4K byte holes, 8K, 16K, 32K, 64K, 128K, 256K, 512K, and one list for 1 MB holes. All of them are initially empty, except for the 1 MB list, which has only one hole.
Let's now imagine a scenario where we want to allocate a 70K block. The buddy allocator will round it up to 128K and will end up splitting the 1 MB into two 512K blocks, then 256K, and finally 128K, and then it will allocate one of the 128K blocks to the user. The following are schemes that summarize this scenario:
The deallocation is as fast as allocation. The following is a figure that summarizes the deallocation algorithm:
In the preceding diagram, we can see that memory deallocation using the buddy algorithm works. In the next section, we will study the slab allocator, built on top of this algorithm.
Before we introduce the slab allocator, let's define some terms that it uses:
Slabs may be in one of the following states:
The memory allocator is responsible for building caches. Initially, each slab is empty and marked so. When one allocates memory for a kernel object, the allocator looks for a free location for that object on a partial/free slab in a cache for that type of object. If not found, the allocator allocates a new slab and adds it to the cache. The new object gets allocated from this slab, and the slab is marked as partial. When the code is done with the memory (memory-freed), the object is simply returned to the slab cache in its initialized state. This is the reason why the kernel also provides helper functions to obtain zeroed initialized memory, which allows us to get rid of previous content. The slab keeps a reference count of how many of its objects are being used so that when all slabs in a cache are full and another object is requested, the slab allocator is responsible for adding new slabs.
The following diagram illustrates the concept of slabs, caches, and their different states:
It is a bit like creating a per-object allocator. The kernel allocates one cache per type of object, and only objects of the same type can be stored in a cache (for example, only task_struct structures).
There are different kinds of slab allocators in the kernel, depending on whether one needs compactness, cache-friendliness, or raw speed. These consist of the following:
Note
The term slab has become a generic name referring to a memory allocation strategy employing an object cache, enabling efficient allocation and deallocation of kernel objects. It must not be confused with the allocator of the same name, SLAB, which nowadays has been replaced by SLUB.
kmalloc() is a kernel memory allocation function. It allocates physically contiguous (but not necessarily page-aligned) memory. The following image describes how memory is allocated and returned to the caller:
This allocation API is the general and highest-level memory allocation API in the kernel, which relies on the SLAB allocator. Memory returned from kmalloc() has a kernel logical address because it is allocated from the LOW_MEM region unless HIGH_MEM is specified. It is declared in <linux/slab.h>, which is the header to include before using the API. It is defined as follows:
void *kmalloc(size_t size, int flags);
In the preceding code, size specifies the size of the memory to be allocated (in bytes). flags determines how and where memory should be allocated. Available flags are the same as the page allocator (GFP_KERNEL, GFP_ATOMIC, GFP_DMA, and so on) and the following are their definitions:
On successful allocation of memory, kmalloc() returns the virtual (logical, unless high memory is specified) address of the chunk allocated, guaranteed to be physically contiguous. On an error, it returns NULL.
For a device driver, however, it is recommended to use the managed version, devm_kmalloc(), which does not necessarily require freeing the memory, as it is handled internally by the memory core. The following is its prototype:
void *devm_kmalloc(struct device *dev, size_t size,
gfp_t gfp);
In the preceding prototype, dev is the device for which memory is allocated.
Note that kmalloc() relies on SLAB caches when allocating a small size of memory. For this reason, it can internally round the allocated area size up to the size of the smallest SLAB cache in which that memory can fit. This can result in returning more memory than requested. However, ksize() can be used to determine the actual amount (the size in bytes) of memory allocated. You can even use this additional memory, even though a smaller amount of memory was initially specified with the kmalloc() call.
The following is the ksize prototype:
size_t ksize(const void *objp);
In the preceding, objp is the object whose real size in bytes will be returned.
kmalloc() has the same size limitations as the page-related allocation API. For example, with the default FORCE_MAX_ZONEORDER set to 11, the maximum size per allocation with kmalloc() is 4 MB.
kfree function is used to free the memory allocated by kmalloc(). It is defined as the following:
void kfree(const void *ptr)
The following is an example of allocating and freeing memory using kmalloc() and kfree() respectively:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/mm.h>
static void *ptr;
static int alloc_init(void)
{
size_t size = 1024; /* allocate 1024 bytes */
ptr = kmalloc(size,GFP_KERNEL);
if(!ptr) {
/* handle error */
pr_err("memory allocation failed ");
return -ENOMEM;
} else {
pr_info("Memory allocated successfully ");
}
return 0;
}
static void alloc_exit(void)
{
kfree(ptr);
pr_info("Memory freed ");
}
module_init(alloc_init);
module_exit(alloc_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("John Madieu");
The kernel provides other helpers based on kmalloc()as follows:
void kzalloc(size_t size, gfp_t flags);
void kzfree(const void *p);
void *kcalloc(size_t n, size_t size, gfp_t flags);
void *krealloc(const void *p, size_t new_size,
gfp_t flags);
krealloc() is the kernel equivalent of user space realloc() function. Because memory returned by kmalloc() retains the contents from its previous incarnation, you can request a zeroed kmalloc-allocated memory using kzalloc(). kzfree() is the freeing function for kzalloc(), whereas kcalloc() allocates memory for an array, and its n and size parameters respectively represent the number of elements in the array and the size of an element.
Since kmalloc() returns a memory area in the kernel permanent mapping, the logical address can be translated into a physical address using virt_to_phys(), or to a I/O bus address using virt_to_bus(). These macros internally call either __pa() or __va() if necessary. The physical address (virt_to_phys(kmalloc'ed address)), downshifted by PAGE_SHIFT, will produce a PFN (pfn) of the first page from which the chunk is allocated.
vmalloc() is the last kernel allocator we will discuss in the book. It returns memory that is exclusively contiguous in the virtual address space. The underlying frames are scattered, as we can see in the following diagram:
In the preceding diagram, we can see that memory is not physically contiguous. Moreover, memory returned by vmalloc() always comes from the HIGH_MEM zone. Addresses returned are purely virtual (not logical) and cannot be translated into physical ones or bus addresses because there is no guarantee that the backing memory is physically contiguous. It means that memory returned by vmalloc() can't be used outside of the microprocessor (you cannot easily use it for a DMA purpose). It is correct to use vmalloc() to allocate memory for a large sequence of pages (it does not make sense to use it to allocate one page, for example) that exists only in software, such as a network buffer. It is important to note that vmalloc() is slower than kmalloc() and page allocator functions because it must both retrieve the memory and build the page tables, or even remap into a virtually contiguous range, whereas kmalloc() never does that.
Before using the vmalloc() API, you should include this header:
#include <linux/vmalloc.h>
The following are the vmalloc family prototypes:
void *vmalloc(unsigned long size);
void *vzalloc(unsigned long size);
void vfree(void *addr);
In the preceding prototypes, argument size is the size of memory you need to allocate. Upon successful allocation of memory, it returns the address of the first byte of the allocated memory block. On failure, it returns NULL. vfree() does the reverse and releases the memory allocated by vmalloc(). The vzalloc variant returns zeroed initialized memory.
The following is an example of using vmalloc:
#include<linux/init.h>
#include<linux/module.h>
#include <linux/vmalloc.h>
Static void *ptr;
static int alloc_init(void)
{
unsigned long size = 8192; /* 2 x 4KB */
ptr = vmalloc(size);
if(!ptr)
{
/* handle error */
pr_err("memory allocation failed ");
return -ENOMEM;
} else {
pr_info("Memory allocated successfully ");
}
return 0;
}
static void my_vmalloc_exit(void)
{
vfree(ptr);
pr_info("Memory freed ");
}
module_init(my_vmalloc_init);
module_exit(my_vmalloc_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("john Madieu, [email protected]");
vmalloc() will allocate non-contiguous physical pages and map them to a contiguous virtual address region. These vmalloc virtual addresses are limited in an area of kernel space, delimited by VMALLOC_START and VMALLOC_END, which are architecture-dependent. The kernel exposes /proc/vmallocinfo to display all vmalloc-allocated memory on the system.
vmalloc() prefers the HIGH_MEM zone if it exists, which is suitable for processes, as they require implicit and dynamic mappings. However, because memory is a limited resource, the kernel will report the allocation of frame pages (physical pages) until necessary (when accessed, either by reading or writing). This on-demand allocation is called lazy allocation, eliminating the risk of allocating pages that will never be used.
Whenever a page is requested, only the page table is updated; in most cases, a new entry is created, which means only virtual memory is allocated. An interrupt called page fault is raised only when a user accesses the page. This interrupt has a dedicated handler, called page fault handler, and is called by the MMU in response to an attempt to access virtual memory, which did not immediately succeed.
In fact, a page fault interrupt is raised whatever the access type is (read, write, or execute) to a page whose entry in the page table has not got the appropriate permission bits set to allow that type of access. The response to that interrupt falls in one of the following three ways:
To summarize, memory mappings generally start out with no physical pages attached, only by defining the virtual address ranges without any associated physical memory. The actual physical memory is allocated later in response to a page fault exception when the memory is accessed, since the kernel provides some flags to determine whether the attempted access was legal and specifies the behavior of the page fault handler. Thus, the brk() user space, mmap(), and similar allocate (virtual) space, but physical memory is attached later.
Note
A page fault occurring in interrupt context causes a double fault interrupt, which usually panics the kernel (calling the panic() function). It is the reason why memory allocated in interrupt context is taken from a memory pool, which does not raise a page fault interrupt. If an interrupt occurs when a double fault is being handled, a triple fault exception is generated, causing the CPU to shut down and the OS to immediately reboot. This behavior is architecture-dependent.
Let's consider a memory region or data that needs to be shared by two or more tasks. The Copy on Write (CoW) (heavily used with fork() system call) is a mechanism that allows the operating system not to immediately allocate memory and does not make a copy of it to each task that shares this data, until one of these tasks modifies (writes into) it – in this case, memory is allocated for its private copy (hence the name, CoW). Let's consider a shared memory page to describe in in the following how the page fault handler manages CoW:
We can now close our parenthesis on process memory allocation, with which we are now familiar. We have also learned the lazy allocation mechanism and what CoW is. We can also conclude our learning about in-kernel memory allocation. At this point, we can switch to I/O memory operations to talk with hardware devices.
So far, we have dealt with main memory, and we used to think of memory in terms of RAM. That said, RAM one a peripheral among many others, and its memory range corresponds to its size. RAM is unique in the way it is entirely managed by the kernel, transparently for users. The RAM controller is connected to the CPU data/control/address buses, which it shares with other devices. These devices are referred to as memory-mapped devices because of their locality regarding those buses, and communication (input/output operations) with those devices is called memory-mapped I/O. These devices include controllers for various buses provided by the CPU (USB, UART, SPI, I2C, PCI, and SATA), but also IPs such as VPU, GPU, Image Processing Unit (IPU), and Secure Non-Volatile Store (SNVS, a feature in i.MX chips from NXP).
On a 32-bit system, the CPU has up to 232 choices of memory locations (from 0 to 0xFFFFFFFF). The thing is that not all those addresses address RAM. Some of these are reserved for peripheral access and are called I/O memory. This I/O memory is split into ranges of various sizes and assigned to those peripherals so that whenever the CPU receives a physical memory access request from the kernel, it can route it to the device whose address range contains the specified physical address. The address range assigned to each device (including the RAM controller) is described in the SoC data sheet, in a section called memory map most of the time.
Since the kernel exclusively works with virtual addresses (through page tables), accessing a particular address for any device would require this address to be mapped first (this is even more true if there is an IOMMU, the MMU equivalent for I/O devices). This mapping of memory addresses other than RAM modules causes a classic hole in the system address space (because address space is shared between memory and I/O).
The following diagram describes how I/O memory and main memory are seen by the CPU:
Note
Always keep in mind that the CPU sees main memory (RAM) through the lenses of the MMU and devices through the lenses of IOMMU.
The main advantage with that is that the same instructions are used for transferring data to memory and I/O, which reduces software coding logic. There are some disadvantages, however. The first one is that the entire address bus must be fully decoded for every device, which increases the cost of adding hardware to the machine, leading to complex architecture.
The other inconvenience is that on a 32-bit system, even with 4 GB of RAM installed, the OS will never use the whole size because of the hole caused by memory-mapped devices, which stole part of the address space. x86 architectures adopted another approach called Port Input Output (PIO), with which registers are accessible via a dedicated bus by means of specific instructions (in and out, in an assembler, generally). In this case, device registers are not memory-mapped, and the system can address the whole address range for the RAM.
On a system where PIO is used, I/O devices are mapped into a separate address space. This is usually accomplished by having a different set of signal lines to indicate memory access versus device access. Such systems have two different address spaces, one for system memory, which we already discussed, and the other one for I/O ports, sometimes referred to as port address space and limited to 65,536 ports. This is an old method and very uncommon nowadays.
The kernel exports a few functions (symbols) to handle I/O ports. Prior to accessing any port regions, we must first inform the kernel that we are using a range of ports using the request_region() function, which will return NULL on error. Once done with the region, we must call release_region(). These are both declared in linux/ioport.h as the following:
struct ressource *request_region(unsigned long start,
unsigned long len, char *name);
void release_region(unsigned long start,
unsigned long len);
These are politeness functions that inform the kernel about your intention to make use/release of a region of len ports, starting from start. The name parameter should be set with the name of the device or a meaningful one. Their use is not mandatory, however. It prevents two or more drivers from referencing the same range of ports. You can consult the ports currently in use on the system by reading the content of the /proc/ioports files.
After the region reservation has succeeded, the following APIs can be used to access the ports:
u8 inb(unsigned long addr)
u16 inw(unsigned long addr)
u32 inl(unsigned long addr)
The preceding functions respectively read 8, 16, or 32-bit-sized (wide) data from the addr ports. The write variants are defined as the following:
void outb(u8 b, unsigned long addr)
void outw(u16 b, unsigned long addr)
void outl(u32 b, unsigned long addr)
The preceding functions write b data, which can be 8, 16, or 32-bit-sized, into the addr port.
The fact that PIO uses a different set of instructions to access the I/O ports or MMIO is a disadvantage, as it requires more instructions than normal memory to accomplish the same task. For instance, 1-bit testing has only one instruction in MMIO, whereas PIO requires reading the data into a register before testing the bit, which is more than one instruction. One of the advantages of PIO is that it requires less logic to decode addresses, lowering the cost of adding hardware devices.
Main memory addresses reside in the same address space as MMIO addresses. The kernel maps the device registers to a portion of the address space that would ordinarily be utilized by RAM so that instead of system memory (that is, RAM), I/O device registration takes place. As a result, talking with an I/O device is analogous to reading and writing to memory addresses dedicated to that device.
If we need to access, let's say, the 4 MB of I/O memory assigned to IPU-2 (from 0x02400000 to 0x027fffff), the CPU (by means of the IOMMU) can assign to us the 0x10000000 to 0x103FFFFF addresses, which are virtual of course. This is not consuming physical RAM (except for building and storing page table entry), just address space (do you see now why 32-bit systems run into issues with expansion cards such as high-end GPUs that have GB of RAM?), meaning that the kernel will no longer use this virtual memory range to map RAM. Now, a memory write/read to, say, 0x10000004 will be routed to the IPU-2 device. This is the basic premise of memory-mapped I/O.
Like PIO, there are MMIO functions to inform the kernel about our intention to use a memory region. Remember that this information is a pure reservation only. These are request_mem_region() and release_mem_region(), defined as the following:
struct ressource* request_mem_region(unsigned long start,
unsigned long len, char *name)
void release_mem_region(unsigned long start,
unsigned long len)
These are only politeness functions, though the former builds and returns an appropriate resource structure, corresponding to the start and length of the memory region, while the latter releases it.
For the device driver, however, it is recommended to use the managed variant, as it simplifies the code and takes care of releasing the resource. This managed version is defined as the following:
struct ressource* devm_request_region(
struct device *dev, resource_size_t start,
resource_size_t n, const char *name);
In the preceding, dev is the device owning the memory region, and the other parameters are the same as the non-managed version. Upon successful request, the memory region will be visible in /proc/iomem, which is a file that contains memory regions in use on the system.
Prior to accessing a memory region (and after you successfully request it), the region must be mapped into kernel address space by calling special architecture-dependent functions (which make use of IOMMU to build a page table and thus cannot be called from an interrupt handler). These are ioremap() and iounmap(), which handle cache coherency as well. The followings are their definitions:
void __iomem *ioremap(unsigned long phys_addr,
unsigned long size);
void iounmap(void __iomem *addr);
In the preceding functions, phys_addr corresponds to the device's physical address as specified in the device tree or in the board file. size corresponds to the size of the region to map. ioremap() returns a __iomem void pointer to the start of the mapped region. Once again, it is recommended to use the managed version, which has the following definition:
void __iomem *devm_ioremap(struct device *dev,
resource_size_t offset,
resource_size_t size);
Note
ioremap() builds new page tables, just as vmalloc() does. However, it does not actually allocate any memory but instead returns a special virtual address that can be used to access the specified I/O address. On 32-bit systems, the fact that MMIO steals physical memory address space to create a mapping for memory-mapped I/O devices is a disadvantage, since it prevents the system from using the stolen memory for general RAM purposes.
Because the mapping APIs are architecture-dependent, you should not deference (that is, getting/setting their value by reading/writing to the pointer) such pointers, even though on some architectures you can. The kernel provides portable functions to access memory-mapped regions. These are the following:
unsigned int ioread8(void __iomem *addr);
unsigned int ioread16(void __iomem *addr);
unsigned int ioread32(void __iomem *addr);
void iowrite8(u8 value, void __iomem *addr);
void iowrite16(u16 value, void __iomem *addr);
void iowrite32(u32 value, void __iomem *addr);
The preceding functions respectively read and write 8-, 16-, and 32-bit values.
Note
__iomem is a kernel cookie used by Sparse, a semantic checker used by the kernel to find possible coding faults. It prevents mixing normal pointer use (such as dereference) with I/O memory pointers.
In this section, we have learned how to map memory-mapped device memory into kernel address space to access its registers using dedicated APIs. This will serve in driving in-chip devices.
Kernel memory sometimes needs to be remapped, either from kernel to user space, or from high memory to a low memory region (from kernel to kernel space). The common case is remapping the kernel memory to user space, but there are other cases, such as when we need to access high memory.
The Linux kernel permanently maps 896 MB of its address space to the lower 896 MB of the physical memory (low memory). On a 4 GB system, there is only 128 MB left to the kernel to map the remaining 3.2 GB of physical memory (high memory). However, low memory is directly addressable by the kernel because of the permanent and one-to-one mapping. When it comes to high memory (memory preceding 896 MB), the kernel has to map the requested region of high memory into its address space, and the 128 MB mentioned previously is especially reserved for this. The function used to perform this trick is kmap(). The kmap() function is used to map a given page into the kernel address space.
void *kmap(struct page *page);
page is a pointer to the struct page structure to map. When a high memory page is allocated, it is not directly addressable. kmap() is the function we call to temporarily map high memory into the kernel address space. The mapping will last until kunmap() is called:
void kunmap(struct page *page);
By temporarily, I mean the mapping should be undone as soon as it is no longer needed. A best programming practice is to unmap high memory mapping when it is no longer required.
This function works on both high and low memory. However, if a page structure resides in low memory, then just the virtual address of the page is returned (because low-memory pages already have permanent mappings). If the page belongs to high memory, a permanent mapping is created in the kernel's page tables, and the address is returned:
void *kmap(struct page *page)
{
BUG_ON(in_interrupt());
if (!PageHighMem(page))
return page_address(page);
return kmap_high(page);
}
kmap_high() and kunmap_high(), which are defined in mm/highmem.c, are at the heart of these implementations. However, kmap() maps pages into kernel space using a physically contiguous set of page tables allocated during the boot. Because the page tables are all connected, it's simple to move around without having to consult the page directory all the time. You should note that the kmap page tables correspond to kernel virtual addresses beginning with PKMAP BASE, which differs per architecture, and the reference count for its page table entries is kept in a separate array called pkmap_count.
The page frame of the page to map into kernel space is passed to kmap() as a struct *page argument, and this can be a regular or HIGHMEM page; in the first case, kmap() simply returns the direct-mapped address. For HIGHMEM pages, kmap() searches through the kmap page tables (which were allocated at boot time) for an unused entry – that is, an entry whose pkmap_count value is zero. If there are none, it goes to sleep and waits for another process to kunmap a page. When it finds an unused one, it inserts the physical page address of the page we want to map, incrementing at the same time the pkmap_count reference count corresponding to the page table entry, and returns the virtual address to the caller. The page->virtual for the page struct is also updated to reflect the mapped address.
kunmap() expects a struct page* representing the page to unmap. It finds the pkmap_count entry for the page's virtual address and decrements it.
Mapping physical addresses is one of the most common operations, especially in embedded systems. Sometimes, you may want to share part of the kernel memory with user space. As mentioned earlier, the CPU runs in unprivileged mode when running in user space. To let a process access a kernel memory region, we need to remap that region into the process address space.
remap_pfn_range() maps physical contiguous memory into a process address space by means of a VMA. It is particularly useful for implementing the mmap file operation, which is the backend of the mmap() system call.
After invoking the mmap() system call on a file descriptor (a device-backed file or not) given a region start and length, the CPU will switch to privileged mode. An initial kernel code will create an almost empty VMA as large as the requested mapping region and will run the corresponding file_operations.mmap callback, giving the VMA as a parameter. In turn, this callback should call remap_pfn_range(). This function will update the VMA and will derive the kernel's PTE of the mapped region before adding it to the process's page table, with different protection flags of course. The process's VMA list will be updated with the insertion of the VMA entry (with appropriate attributes), which will use the derived PTE to access the same memory. This way, the kernel and user space will both point to the same physical memory region, each through their own page tables but with different protection flags. Thus, instead of wasting memory and CPU cycles by copying, the kernel just duplicates the PTEs, each with their own attributes.
remap_pfn_range() is defined as the following:
int remap_pfn_range(struct vm_area_struct *vma,
unsigned long addr,
unsigned long pfn,
unsigned long size, pgprot_t flags);
A successful call will return 0, and a negative error code is returned on failure. Most of this function's arguments are provided when the mmap() system call is invoked. The following are their descriptions:
unsigned long pfn =
virt_to_phys((void *)kmalloc_area)>>PAGE_SHIFT;
unsigned long pfn = page_to_pfn(page)
unsigned long pfn = vmalloc_to_pfn(vmalloc_area);
Memory mapping works with memory regions that are multiples of PAGE_SIZE, so, for example, you should allocate an entire page instead of using a kmalloc-allocated buffer. kmalloc() can return (if requesting a non-multiple size of PAGE_SIZE) a pointer that isn't page-aligned, and in that case, it is a terribly bad idea to use such an unaligned address with remap_pfn_range(). Nothing will guarantee that the kmalloc()-returned address will be page-aligned, so you might corrupt slab internal data structures. Instead, you should be using kmalloc(PAGE_SIZE * npages or, even better, a page allocation API (or something similar because these functions always return a pointer that is page-aligned).
If your baking object (a file or device) supports an offset, then the VMA offset (the offset into the object where the mapping must start) should be considered to produce the PFN where mapping must start. vma->vm_pgoff will contain this offset (if specified by user space in the mmap()) value in units of the number of pages. The final PFN computation (or the position from where the mapping must start) will look like the following:
unsigned long pos
unsigned long off = vma->vm_pgoff;
/*compute the initial PFN according to the memory area */
[...]
/* Then compute the final position */
pos = pfn + off
[...]
return remap_pfn_range(vma, vma->vm_start,
pos, vma->vm_end - vma->vm_start,
vma->vm_page_prot);
In the preceding excerpt, the offset (specified in term of number of pages) has been included in the final position computation. This offset can, however, be ignored if the driver implementation does need its support.
Note
The offset can be used differently, by left-shifting PAGE_SIZE to obtain the offset by the number of bytes (offset = vma->vm_pgoff << PAGE_SHIFT), and then adding this offset to the memory start address before computing the final PFN (pfn = virt_to_phys(kmalloc_area + offset) >> PAGE_SHIFT).
Note that memory allocated with vmalloc() is not physically contiguous, so if you need to map a memory region allocated with vmalloc(), you must map each page individually and compute the physical address for each page. This can be achieved by looping over all pages in that vmalloc-allocated memory region and calling remap_pfn_range() as follows:
while (length > 0) {
pfn = vmalloc_to_pfn(vmalloc_area_ptr);
if ((ret = remap_pfn_range(vma, start, pfn,
PAGE_SIZE, PAGE_SHARED)) < 0) {
return ret;
}
start += PAGE_SIZE;
vmalloc_area_ptr += PAGE_SIZE;
length -= PAGE_SIZE;
}
In the preceding excerpt, length corresponds to the VMA size (length = vma->vm_end - vma->vm_start). pfn is computed for each page, and the starting address for the next mapping is incremented by PAGE_SIZE to map the next page in the region. The initial value of start is start = vma->vm_start.
That said, from within the kernel, the vmalloc-allocated memory can be used normally. The paginated use is necessary for remapping purposes only.
Remapping the I/O memory requires a device's physical addresses, as specified in the device tree or the board file. In this case, for portability reasons, the appropriate function to use is io_remap_pfn_range(), whose parameters are the same as remap_pfn_range(). The only thing that changes is where the PFN comes from. Its prototype looks like the following:
int io_remap_page_range(struct vm_area_struct *vma,
unsigned long start,
unsigned long phys_pfn,
unsigned long size, pgprot_t flags);
In the preceding function, vma and start have the same meanings as remap_pfn_range(). phys_pfn is different, however, in the way it is obtained; it must correspond to the physical I/O memory address, as it will have been given to ioremap(), right-shifted PAGE_SHIFT times.
There is, however, a simplified io_remap_pfn_range() for common driver use: vm_iomap_memory(). This lite variant is defined as the following:
int vm_iomap_memory(struct vm_area_struct *vma,
phys_addr_t start, unsigned long len)
In the preceding function, vma is the user VMA to map to. start is the start of the I/O memory region to be mapped (as it would have been given to ioremap()), and len is the size of area. With vm_iomap_memory(), the driver just needs to give us the physical memory range to be mapped; the function will figure out the rest from the vma information. As with io_remap_pfn_range(), it returns 0 on success or a negative error code otherwise.
While caching is generally a good idea, it can introduce side effects, especially if, for a memory-mapped device (or even RAM), the values written to the mmap'ed registers must be instantaneously visible to the device.
You should note that, by default, the kernel remaps memory to user space with caching and buffering enabled. To change the default behavior, drivers must disable the cache on the VMA before invoking the remapping API. In order to do so, the kernel provides pgprot_noncached(). In addition to caching, this function also disables the bufferability of the specifier region. This helper takes an initial VMA access protection and returns an updated version with the cache disabled.
It is used as follows:
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
While testing a driver that I've developed for a memory-mapped device, I faced an issue where I had roughly 20 ms of latency (the time between when I updated the device register in user space through the mmap'ed area and the time when it was visible to the device) when caching was used.
After disabling the cache, this latency almost went away, as it fell below 200 µs. Amazing!
From user space, the mmap() system call is used to map physical memory into the address space of the calling process. In order to support this system call in a driver, this driver must implement the file_operations.mmap hook. After the mapping has been done, the user process will be able to write directly into the device memory via the returned address. The kernel will translate any accesses to that mapped region of memory through the usual pointer dereference into file operations.
The mmap() system call is declared as follows:
int mmap (void *addr, size_t len, int prot,
int flags, int fd, ff_t offset);
From the kernel side, the mmap field in the driver's file operation structure (struct file_operations structure) has the following prototype:
int (*mmap)(struct file *filp,
struct vm_area_struct *vma);
In the preceding file operation function, filp is a pointer to the open device file for the driver that results from the translation of the fd parameter (given in the system call). vma is allocated and given as parameter by the kernel. It points to the user process's VMA where the mapping should go. To understand how the kernel creates the new VMA, it uses the parameters given to the mmap() system call, which somehow affect some fields of the VMA as follows:
With all these parameters defined, we can split the mmap file operation implementation into the following steps:
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
if (offset >= buffer_size)
return -EINVAL;
unsigned long size = vma->vm_end - vma->vm_start;
if (buffer_size < (size + offset))
return -EINVAL;
unsigned long pfn;
pfn = virt_to_phys(buffer + offset) >> PAGE_SHIFT;
if (remap_pfn_range(vma, vma->vm_start, pfn,
size, vma->vm_page_prot)) {
return -EAGAIN;
}
return 0;
static const struct file_operations my_fops = {
.owner = THIS_MODULE,
[...]
.mmap = my_mmap,
[...]
};
This file operation implementation closes our series on memory mappings. In this section, we have learned how mappings work under the hood and all the mechanisms involved, caching considerations included.
This chapter is one of the most important chapters. It demystifies memory management and allocation (how and where) in the Linux kernel. It teaches in detail how mapping and address translation work. Some other aspects, such as talking with hardware devices and remapping memory for user space (on behalf of the mmap() system call), were discussed in detail.
This provides a strong base to introduce and understand the next chapter, which deals with Direct Memory Access (DMA).