The Linux I/O code path in brief

To understand what the issue is, we must first gain a bit of a deeper understanding of how the I/O code path actually works; the following diagram encapsulates the points of relevance:

Figure 1: Page cache populated with
disk data

The reader should realize that though this diagram seems quite detailed, we're actually seeing a rather simplistic view of the entire Linux I/O code path (or I/O stack), only what is relevant to this discussion. For a more detailed overview (and diagram), please see the link provided in the Further reading section on the GitHub repository.

Let's say that a Process P1 intends to read some 12 KB of data from a target file that it has open (via the open(2) system call); we envision that it does so via the usual manner:

Allocate a heap buffer of 12 KB (3 pages = 12,288 bytes) via the malloc(3) API.
Issue the read(2) system call to read in the data from the file into the heap buffer.
- The read(2) system call performs the work within the OS; when the read is done, it returns (hopefully the value 12,288; remember, it's the programmer's job to check this and not assume anything).

This sounds simple, but there's a lot more that happens under the hood, and it is in our interest to dig a little deeper. Here's a more detailed view of what happens (the numerical points 1, 2, and 3 are shown in a circle in the previous diagram; follow along):

Process P1 allocates a heap buffer of 12 KB via the malloc(3) API (len = 12 KB = 12,288 bytes).
Next, it issues a read(2) system call to read data from the file (specified by fd) into the heap buffer buf just allocated, for length 12 KB.
As read(2) is a system call, the process (or thread) now switches to kernel mode (remember the monolithic design we covered back in Chapter 1, Linux System Architecture); it enters the Linux kernel's generic filesystem layer (called the Virtual Filesystem Switch (VFS)), from where it will be auto-shunted on to its appropriate underlying filesystem driver (perhaps the ext4 fs), after which the Linux kernel will first check: are these pages of the required file data already cached in our page cache? If yes, the job is done, (we short circuit to step 7), just copy back the pages to the user space buffer. Let's say we get a cache miss—the required file data pages aren't in the page cache.

Thus, the kernel first allocates sufficient RAM (page frames) for the page cache (in our example, three frames, shown as pink squares within the page cache memory region). It then fires off appropriate I/O requests to the underlying layers requesting the file data.
The request ultimately ends up at the block (storage) driver; we assume it knows its job and reads the required data blocks from the underlying storage device controller (a disk or flash controller chip, perhaps). It then (here's the interesting thing) is given a destination address to write the file data to; it's the address of the page frames allocated (step 4) within the page cache; thus, the block driver always writes the file data into the kernel's page cache and never directly back to the user mode process buffers.
The block driver has successfully copied the data blocks from the storage device (or whatever) into the previously allocated frames within the kernel page cache. (In reality, these data transfers are highly optimized via an advanced memory transfer technique called Direct Memory Access (DMA), wherein, essentially, the driver exploits the hardware to directly transfer data to and from the device and system memory without the CPU's intervention. Obviously, these topics are well beyond the scope of this book.)
The just-populated kernel page cache frames are now copied into the user space heap buffer by the kernel.
The (blocking) read(2) system call now terminates, returning the value 12,288 indicating that all three pages of file data have indeed been transferred (again, you, the app developer, are supposed to check this return value and not assume anything).

It's all looking great, yes? Well, not really; think carefully on this: though the read(2) (or pread[v][2](2)) API did succeed, this success came at a considerable price: the kernel had to allocate RAM (page frames) in order to hold the file data within its page cache (step 4) and, once data transfer was done (step 6) then copied that content into the user space heap memory (step 7). Thus, we have used twice the amount of RAM that we should have by keeping an extra copy of the data. This is highly wasteful, and, obviously, the multiple copying around of the data buffers between the block driver to the kernel page cache and then the kernel page cache to the user space heap buffer, reduces performance as well (not to mention that the CPU caches get unnecessarily caught up with all this trashing their content). With the previous pattern of code, the issue of not waiting for the slow storage device is taken care of (via the page cache efficiencies), but everything else is really poor—we have actually doubled the required memory usage and the CPU caches are overwritten with (unnecessary) file data while copying takes place.

Table of Contents for The Linux I/O code path in brief

Create new playlist

Sign In

Sign Up

Table of Contents for
The Linux I/O code path in brief