As we have seen, in Version 2.4 of Linux, there is no substantial difference between accessing a regular file through the filesystem, accessing it by referencing its blocks on the underlying block device file, or even establishing a file memory mapping. There are, however, some highly sophisticated programs (self-caching applications ) that would like to have full control of the whole I/O data transfer mechanism. Consider, for example, high-performance database servers: most of them implement their own caching mechanisms that exploit the peculiar nature of the queries to the database. For these kinds of programs, the kernel page cache doesn’t help; on the contrary, it is detrimental for the following reasons:
Lots of page frames are wasted to duplicate disk data already in RAM (in the user-level disk cache)
The read( )
and write( )
system
calls are slowed down by the redundant instructions that handle the
page cache and the read-ahead; ditto for the paging operations
related to the file memory mappings
Rather than transferring the data directly between the disk and the
user memory, the read( )
and write( )
system calls make two transfers: between the disk and a
kernel buffer and between the kernel buffer and the user memory
Since block hardware devices must be handled through interrupts and Direct Memory Access (DMA), and this can be done only in Kernel Mode, some sort of kernel support is definitively required to implement self-caching applications.
Version 2.4 of Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
As we know, any data transfer proceeds asynchronously. While it is in progress, the kernel may switch the current process, the CPU may return to User Mode, the pages of the process that raised the data transfer might be swapped out, and so on. This works just fine for ordinary I/O data transfers because they involve pages of the disk caches. Disk caches are owned by the kernel, cannot be swapped out, and are visible to all processes in Kernel Mode.
On the other hand, direct I/O transfers should move data within pages that belong to the User Mode address space of a given process. The kernel must take care that these pages are accessible by any process in Kernel Mode and that they are not swapped out while the data transfer is in progress. This is achieved thanks to the “direct access buffers.”
A direct access buffer
consists of a set of physical page
frames reserved for direct I/O data transfers, which are mapped both
by the User Mode Page Tables of a self-caching application and by the
kernel Page Tables (the Kernel Mode Page Tables of each process).
Each direct access buffer is described by a kiobuf
data structure, whose fields are shown in Table 15-2.
Table 15-2. The fields of the direct access buffer descriptor
Type |
Field |
Description |
---|---|---|
|
|
Number of pages in the direct access buffer |
|
|
Number of free elements in the |
|
|
Offset to valid data inside the first page of the direct access buffer |
|
|
Length of valid data inside the direct access buffer |
|
|
List of page descriptor pointers referring to pages in the direct
access buffer (usually points to the |
|
|
Lock flag for all pages in the direct access buffer |
|
|
Array of 129 page descriptor pointers |
|
|
Array of 1,024 preallocated buffer head pointers |
|
|
Array of 1,024 logical block numbers |
|
|
Atomic flag that indicates whether I/O is in progress |
|
|
Error number of last I/O operation |
|
|
Completion method |
|
|
Queue of processes waiting for I/O to complete |
Suppose a self-caching application wishes to directly access a file.
As a first step, the application opens the file specifying the
O_DIRECT
flag (see Section 12.6.1). While servicing the open( )
system call, the dentry_open( )
function checks the value of this flag; if it is set, the function
invokes alloc_kiovec( )
, which allocates a new
direct access buffer descriptor and stores its address into the
f_iobuf
field of the file object. Initially the
buffer includes no page frames, so the nr_pages
field of the descriptor stores the value 0. The
alloc_kiovec( )
, however, preallocates 1,024
buffer heads, whose addresses are stored in the bh
array of the descriptor. These buffer heads ensure that the
self-caching application is not blocked while directly accessing the
file (recall that ordinary data transfers block if no free buffer
heads are available). A drawback of this approach, however, is that
data transfers must be done in chunks of at most 512 KB.
Next, suppose the self-caching application issues a read( )
or write( )
system call on the file
opened with O_DIRECT
. As mentioned earlier in this
chapter, the generic_file_read( )
and
generic_file_write( )
functions check the value of
the flag and handle the case in a special way. For instance, the
generic_file_read( )
function executes a code
fragment essentially equivalent to the following:
if (filp->f_flags & O_DIRECT) { inode = filp->f_dentry->d_inode->i_mapping->host; if (count == 0 || *ppos >= inode->i_size) return 0; if (*ppos + count > inode->i_size) count = inode->i_size - *ppos; retval = generic_file_direct_IO(READ, filp, buf, count, *ppos); if (retval > 0) *ppos += retval; UPDATE_ATIME(filp->f_dentry->d_inode); return retval; }
The function checks the current values of the file pointer, the file
size, and the number of requested characters, and then invokes the
generic_file_direct_IO( )
function, passing to it
the READ
operation type, the file object pointer,
the address of the User Mode buffer, the number of requested bytes,
and the file pointer. The generic_file_write( )
function is similar, but of course it passes the
WRITE
operation type to the
generic_file_direct_IO( )
function.
The generic_file_direct_IO( )
function performs
the following steps:
Tests and sets the f_iobuf_lock
lock in the file
object. If it was already set, the direct access buffer descriptor
stored in f_iobuf
is already in use by a
concurrent direct I/O transfer, so the function allocates a new
direct access buffer descriptor and uses it in the following steps.
Checks that the file pointer offset and the number of requested
characters are multiples of the block size of the file; returns
-EINVAL
if they are not.
Checks that the direct_IO
method of the
address_space
object of the file
(filp->f_dentry->d_inode->i_mapping
) is
defined; returns -EINVAL
if it
isn’t.
Even if the self-caching application is accessing the file directly,
there could be other applications in the system that access the file
through the page cache. To avoid data loss, the disk image is
synchronized with the page cache before starting the direct I/O
transfer. The function flushes the dirty pages belonging to memory
mappings of the file to disk by invoking the
filemap_fdatasync( )
function (see the previous
section).
Flushes to disk the dirty pages updated by write( )
system calls by invoking the
fsync_inode_data_buffers( )
function, and waits
until the I/O transfer terminates.
Invokes the filemap_fdatawait( )
function to wait
until the I/O operations started in the Step 4 complete (see the
previous section).
Starts a loop, and divides the data to be transferred in chunks of 512 KB. For every chunk, the function performs the following substeps:
Invokes map_user_kiobuf( )
to establish a mapping
between the direct access buffer and the portion of the user-level
buffer corresponding to the chunk. To achieve this, the function:
Invokes expand_kiobuf( )
to allocate a new array
of page descriptor addresses in case the array embedded in the direct
access buffer descriptor is too small. This is not the case here,
however, because the 129 entries in the map_array
field suffice to map the chunk of 512 KB (notice that the additional
page is required when the buffer is not page-aligned).
Accesses all user pages in the chunk (allocating them when necessary
by simulating Page Faults) and stores their addresses in the array
pointed to by the maplist
field of the direct
access buffer descriptor.
Properly initializes the nr_pages
,
offset
, and length
fields, and
resets the locked
field to 0.
Invokes the direct_IO
method of the
address_space
object of the file (explained next).
If the operation type was READ
, invokes
mark_dirty_kiobuf( )
to mark the pages mapped by
the direct access buffer as dirty.
Invokes unmap_kiobuf( )
to release the mapping
between the chunk and the direct access buffer, and then continues
with the next chunk.
If the function allocated a temporary direct access buffer descriptor
in Step 1, it releases it. Otherwise, it releases the
f_iobuf_lock
lock in the file object.
In almost all cases, the direct_IO
method is a
wrapper for the generic_direct_IO( )
function,
passing it the address of the usual filesystem-dependent function
that computes the position of the physical blocks on the block device
(see the earlier section Section 15.1.1). This function executes the
following steps:
For each block of the file portion corresponding to the current
chunk, invokes the filesystem-dependent function to determine its
logical block number, and stores this number in an entry of the
blocks
array in the direct access buffer
descriptor. The 1,024 entries of the array suffice because the
minimum block size in Linux is 512 bytes.
Invokes the brw_kiovec( )
function, which
essentially calls the submit_bh( )
function on
each block in the blocks
array using the buffer
heads stored in the bh
array of the direct access
buffer descriptor. The direct I/O operation is similar to a buffer or
page I/O operation, but the b_end_io
method of the
buffer heads is set to the special function
end_buffer_io_kiobuf( )
rather than to
end_buffer_io_sync( )
or
end_buffer_io_async( )
(see Section 13.4.8). The method deals with the fields of the
kiobuf
data structure. brw_kiovec( )
does not return until the I/O data transfers are
completed.