Typical block devices like hard disks have very high average access times. Each operation requires several milliseconds to complete, mainly because the hard disk controller must move the heads on the disk surface to reach the exact position where the data is recorded. However, when the heads are correctly placed, data transfer can be sustained at rates of tens of megabytes per second.
To achieve acceptable performance, hard disks and similar devices transfer several adjacent bytes at once. In the following discussion, we say that groups of bytes are adjacent when they are recorded on the disk surface in such a manner that a single seek operation can access them.
The organization of Linux block device handlers is quite involved. We won’t be able to discuss in detail all the functions that are included in the kernel to support the handlers. But we outline the general software architecture and introduce the main data structures. Kernel support for block device handlers includes the following features:
A uniform interface through the VFS
Efficient read-ahead of disk data
Disk caching for the data
When a block device file is being opened, the kernel must determine whether the device file is already open. In fact, if the file is already open, the kernel must not initialize the corresponding block device driver.
This problem is as easy as it appears at first look. On the one hand, we stated in the earlier section Section 13.2 that block device files that have the same major number are usually associated with the same block device driver. However, each block device driver that handles more than one minor number can be considered several specialized block device drivers, so this case doesn’t create problems. In the rest of this section, when we use the term “block device driver,” we mean the kernel layer that handles I/O data transfers from/to a hardware device specified by both a major number and a minor number.
A real complication, however, is that block device files that have the same major and minor numbers but different pathnames are regarded by the VFS as different files, but they really refer to the same block device driver. Therefore, the kernel cannot determine whether a block device driver is already in use by simply checking for the existence in the inode cache of an object for the block device file.
To keep track of which block device drivers are currently in use, the kernel uses a hash table indexed by the major and minor numbers. Every time a block device driver is being used, the kernel checks whether the corresponding block device driver identified by the major and minor numbers is already stored in the hash table. If so, the block device driver is already in use; notice that the hash function works on the major and minor numbers of the block device file, thus it doesn’t matter whether the block device driver was previously activated by accessing a given block device file, or another one that has the same major and minor numbers. Conversely, if a block device driver associated with the given major and minor numbers is not found, the kernel inserts a new element into the hash table.
The hash table array is
stored in bdev_hashtable
variable; it includes 64
lists of block device descriptors. Each descriptor is a
block_device
data structure whose fields are shown
in Table 13-4.
Table 13-4. The fields of the block device descriptor
Type |
Field |
Description |
---|---|---|
|
|
Pointers for the hash table list |
|
|
Usage counter for the block device descriptor |
|
|
Pointer to the main inode object of the block device driver |
|
|
Major and minor numbers of the block device |
|
|
Counter of how many times the block device driver has been opened |
|
bd_op |
Pointer to the block device driver operation table |
|
|
Semaphore protecting the block device driver |
|
bd_inodes |
List of inodes of opened block device files for this driver |
The bd_inodes
field of the block device descriptor
stores the head (the first dummy element) of a doubly linked circular
list of inodes relative to opened block device files that refer to
the block device driver. The i_devices
field of
the inode object stores the pointers for the previous and next
element in this list.
Each block device descriptor stores in the
bd_inode
field the address of a special
block device inode
object for the driver. This inode
doesn’t correspond to a disk file; rather, it
belongs to the bdev special filesystem (see
Section 12.3.1). Essentially, the block device
inode stores the “master copy” of
the information shared by all inode objects of the block device files
that refer to the same block device.
Let’s
now describe how a block device driver is initialized. We already
described how the kernel customizes the methods of the file object
when a block device file is opened in Section 13.2.3. Its f_op
field is set to the address of the def_blk_fops
variable. The contents of this table are shown in Table 13-5. The dentry_open( )
function checks whether the open
method is
defined; this is always true for a block device file, so the
blkdev_open( )
function is executed.
This function checks whether the block device driver is already in use:
bd_acquire(inode); do_open(inode->i_bdev, filp);
The bd_acquire( )
function essentially executes
the following operations:
Checks whether the block device file corresponding to the inode
object is already open (in this case,
inode->i_bdev
field points to the block device
descriptor). If the file is already open, increments the usage
counter of the block device descriptor
(inode->i_bdev->bd_count
) and returns.
Looks up the block device driver in the hash table using the major
and minor numbers stored in inode->rdev
. If the
descriptor is not found because the driver is not in use, allocates a
new block_device
and a new inode object for the
block device, and then inserts the new descriptor in the hash table.
Stores the address of the block device driver descriptor in
inode->i_bdev
.
Adds inode
to the list of inodes of the driver
descriptor.
Next, blkdev_open( )
invokes do_open( )
, which executes the following main steps:
If the bd_op
field of the block device driver
descriptor is NULL
, initializes it from the
blkdevs
table’s element
corresponding to the major number of the block device file
Invokes the open
method of the block device driver
descriptor (bd_op->open
) if it is defined
Increments the bd_openers
counter of the block
device driver descriptor
Sets the i_size
and i_blkbits
fields of the block device inode object (bd_inode
)
The open
method of the block device driver
descriptor can further customize the methods of the block device
driver, allocate resources, and take other measures based on the
minor number of the block device file.
Among other things, the device driver initialization function must
determine the size of the physical block device corresponding to the
device file. This length, represented in 1,024-byte units, is stored
in the blk_size
global array indexed by both the
major and minor number of the device file.
Each data transfer operation for a block device acts on a group of adjacent bytes called a sector . In most disk devices, the size of a sector is 512 bytes, although there are devices that use larger sectors (1,024 and 2,048 bytes). Notice that the sector should be considered the basic unit of data transfer; it is never possible to transfer less than a sector, although most disk devices are capable of transferring several adjacent sectors at once.
The kernel stores the sector size of each hardware block device in a
table named hardsect_size
. Each element in the
table is indexed by the major number and the minor number of the
corresponding block device file. Thus,
hardsect_size[3][2]
represents the sector size of
/dev/hda2
, which is the second primary partition
of the first IDE disk (see Table 13-2). If
hardsect_size[
maj
]
is NULL
, all block devices sharing the major
number maj have a standard sector size of 512
bytes.
Block device drivers transfer a large number of adjacent bytes called a block in a single operation. A block should not be confused with a sector. The sector is the basic unit of data transfer for the hardware device, while the block is simply a group of adjacent bytes involved in an I/O operation requested by a device driver.
In Linux, the block size must be a power of 2 and cannot be larger than a page frame. Moreover, it must be a multiple of the sector size, since each block must include an integral number of sectors. Therefore, on PC architecture, the permitted block sizes are 512, 1,024, 2,048, and 4,096 bytes. The same block device driver may operate with several block sizes, since it has to handle a set of device files sharing the same major number, while each block device file has its own predefined block size. For instance, a block device driver could handle a hard disk with two partitions containing an Ext2 filesystem and a swap area (see Chapter 16 and Chapter 17). In this case, the device driver uses two different block sizes: 1,024 bytes for the Ext2 partition and 4,096 bytes for the swap partition.
The kernel stores the block size in a table named
blksize_size
; each element in the table is indexed
by the major number and the minor number of the corresponding block
device file. If
blksize_size[
maj
]
is NULL
, all block devices sharing the major
number maj have a standard block size of 1,024
bytes. (You should not confuse blk_size
with the
blksize_size
array, which stores the block size of
the block devices rather than the size of the block device
themselves.)
Each block requires its own buffer , which is a RAM memory area used by the kernel to store the block’s content. When a device driver reads a block from disk, it fills the corresponding buffer with the values obtained from the hardware device; similarly, when a device driver writes a block on disk, it updates the corresponding group of adjacent bytes on the hardware device with the actual values of the associated buffer. The size of a buffer always matches the size of the corresponding block.
The buffer head
is a descriptor of type buffer_head
associated
with each buffer. It contains all the information needed by the
kernel to know how to handle the buffer; thus, before operating on
each buffer, the kernel checks its buffer head.
The buffer head fields are listed in Table 13-6.
The b_data
field of each buffer head stores the
starting address of the corresponding buffer. Since a page frame may
store several buffers, the b_this_page
field
points to the buffer head of the next buffer in the page. This field
facilitates the storage and retrieval of entire page frames (see
Section 13.4.8.2 later in this
chapter). The b_blocknr
field stores the
logical block number
(i.e., the index of the block
inside the disk partition).
Table 13-6. The fields of a buffer head
Type |
Field |
Description |
---|---|---|
|
|
Next item in collision hash list |
|
|
Logical block number |
|
|
Block size |
|
|
LRU list including the buffer head |
|
|
Virtual device identifier |
|
|
Block usage counter |
|
|
Real device identifier |
|
|
Buffer status flags |
|
|
Flushing time for buffer |
|
|
Next item in list of buffer heads |
|
|
Previous item in list of buffer heads |
|
|
Per-page buffer list |
|
|
Next item in the request queue |
|
|
Previous item in collision hash list |
|
|
Pointer to buffer |
|
|
Pointer to the descriptor of the page that stores the buffer |
|
|
I/O completion method |
|
|
Specialized device driver data |
|
|
Block number on real device |
|
|
Buffer wait queue |
|
|
Pointer to inode object to which the buffer belongs |
|
|
Pointers for list of inode buffers |
The b_state
field stores the following flags:
BH_Uptodate
Set if the buffer contains valid data. The value of this flag is
returned by the buffer_uptodate( )
macro.
BH_Dirty
Set if the buffer is dirty—that is, if it contains data that
must be written to the block device. The value of this flag is
returned by the buffer_dirty( )
macro.
BH_Lock
Set if the buffer is locked, which happens if the buffer is involved
in a disk transfer. The value of this flag is returned by the
buffer_locked( )
macro.
BH_Req
Set if the corresponding block is requested (see the next section)
and has valid (up-to-date) data. The value of this flag is returned
by the buffer_req( )
macro.
BH_Mapped
Set if the buffer is mapped to disk—that is, if the
b_dev
and b_blocknr
fields of
the corresponding buffer head are significant. The value of this flag
is returned by the buffer_mapped( )
macro.
BH_New
Set if the corresponding file block has just been allocated and has
never been accessed. The value of this flag is returned by the
buffer_new( )
macro.
BH_Async
Set if the buffer is being processed by end_buffer_io_async( )
(described in the later section Section 13.4.8.2). The value of this flag is
returned by the buffer_async( )
macro.
BH_Wait_IO
Used to delay flushing the buffer to disk when reclaiming memory (see Chapter 16).
BH_launder
Set when the buffer is being flushed to disk when reclaiming memory (see Chapter 16).
BH_JBD
Set if the buffer is used by a journaling filesystem (see Chapter 17).
The b_dev
field identifies the virtual device
containing the block stored in the buffer, while the
b_rdev
field identifies the real device. This
distinction, which is meaningless for simple hard disks, has been
introduced to model Redundant Array of Independent Disks (RAID)
storage units consisting of several disks operating in parallel. For
reasons of safety and efficiency, files stored in a RAID array are
scattered across several disks that the applications think of as a
single logical disk. Besides the b_blocknr
field
representing the logical block number, it is necessary to specify the
specific disk unit in the b_rdev
field and the
corresponding sector number in the b_rsector
field.
Although block device drivers are able to transfer a single block at a time, the kernel does not perform an individual I/O operation for each block to be accessed on disk; this would lead to poor disk performances, since locating the physical position of a block on the disk surface is quite time-consuming. Instead, the kernel tries, whenever possible, to cluster several blocks and handle them as a whole, thus reducing the average number of head movements.
When a process, the VFS layer, or any other kernel component wishes to read or write a disk block, it actually creates a block device request . That request essentially describes the requested block and the kind of operation to be performed on it (read or write). However, the kernel does not satisfy a request as soon as it is created—the I/O operation is just scheduled and will be performed at a later time. This artificial delay is paradoxically the crucial mechanism for boosting the performance of block devices. When a new block data transfer is requested, the kernel checks whether it can be satisfied by slightly enlarging a previous request that is still waiting (i.e., whether the new request can be satisfied without further seek operations). Since disks tend to be accessed sequentially, this simple mechanism is very effective.
Deferring requests complicates block device handling. For instance, suppose a process opens a regular file and, consequently, a filesystem driver wants to read the corresponding inode from disk. The block device driver puts the request on a queue and the process is suspended until the block storing the inode is transferred. However, the block device driver itself cannot be blocked because any other process trying to access the same disk would be blocked as well.
To keep the block device driver from being suspended, each I/O operation is processed asynchronously. Thus, no kernel control path is forced to wait until a data transfer completes. In particular, block device drivers are interrupt-driven (see Section 13.4.5.2 earlier in this chapter): a high-level driver creates a new block device request or enlarges an already existing block device request and then terminates. A low-level driver , which is activated at a later time, invokes a so-called strategy routine , which takes the request from a queue and satisfies it by issuing suitable commands to the disk controller. When the I/O operation terminates, the disk controller raises an interrupt and the corresponding handler invokes the strategy routine again, if necessary, to process another request in the queue.
Each block device driver maintains its own request queues ; there should be one request queue for each physical block device, so that the requests can be ordered in such a way as to increase disk performance. The strategy routine can thus sequentially scan the queue and service all requests with the minimum number of head movements.
Each block device request is
represented by a request descriptor
, which is stored in the
request
data structure illustrated in Table 13-7. The direction of the data transfer is stored
in the cmd
field; it is either
READ
(from block device to RAM) or
WRITE
(from RAM to block device). The
rq_status
field is used to specify the status of
the request; for most block devices, it is simply set either to
RQ_INACTIVE
(for a request descriptor not in use)
or to RQ_ACTIVE
(for a valid request that is to be
serviced or is already being serviced by the low-level driver).
Table 13-7. The fields of a request descriptor
Type |
Field |
Description |
---|---|---|
|
|
Pointers for request queue list |
|
|
The “age” of the request for the elevator algorithm |
|
|
Request status |
|
|
Device identifier |
|
|
Requested operation |
|
|
Success or failure code |
|
|
First sector number on the (virtual) block device |
|
|
Number of sectors of the request on the (virtual) block device |
|
|
First sector number of the (real) block device |
|
|
Number of sectors of the request on the (real) block device |
|
|
Number of segments in the request on the (virtual) block device |
|
|
Number of segments in the request on the (real) block device |
|
|
Number of sectors in the block currently transferred |
|
|
Used only by drivers of SCSI devices |
|
|
Memory area for I/O transfer |
|
|
Wait queue associated with request |
|
|
First buffer descriptor of the request |
|
|
Last buffer descriptor of the request |
|
|
Pointer to request queue descriptor |
The request may encompass many adjacent blocks on the same device.
The rq_dev
field identifies the block device,
while the sector
field specifies the number of the
first sector of the first block in the request. The
nr_sector
field specifies the number of sectors in
the request yet to be transferred. The
current_nr_sector
field stores the number of
sectors in first block of the request. As we’ll
later see in Section 13.4.7, the
sector
, nr_sector
, and
current_nr_sector
fields could be dynamically
updated while the request is being serviced.
The nr_segments
field store the number of segments
in the request. Although all blocks in the requests must be adjacent
on the block device, their corresponding buffers are not necessarily
contiguous in RAM. A
segment
is a sequence of adjacent blocks in the request whose corresponding
buffers are also contiguous in memory. Of course, a low-level device
driver could program the DMA controller so as to transfer all blocks
in the same segment in a single operation.
The hard_sector
,
hard_nr_sectors
, and
nr_hw_segments
fields usually have the same value
as the sector
, nr_sectors
, and
nr_segments
fields, respectively. They differ,
however, when the request refers to a driver that handles several
physical block devices at once. A typical example of such a driver is
the Logical Volume Manager (LVM), which is able to handle several
disks and disk partitions as a single virtual disk partition. In this
case, the two series of fields differ because the former refers to
the real physical block device, while the latter refers to the
virtual device. Another example is software RAID, a driver that
duplicates data on several disks to enhance reliability.
All buffer heads of the blocks in the request are collected in a
simply linked list. The b_reqnext
field of each
buffer head points to the next element in the list, while the
bh
and bhtail
fields of the
request descriptor point, respectively, to the first element and the
last element in the list.
The buffer
field of the request descriptor points
to the memory area used for the actual data transfer. If the request
involves a single block, buffer
is just a copy of
the b_data
field of the buffer head. However, if
the request encompasses several blocks whose buffers are not
consecutive in memory, the buffers are linked through the
b_reqnext
fields of their buffer heads as shown in
Figure 13-3. On a read, the low-level driver could
choose to allocate a large memory area referred by
buffer
, read all sectors of the request at once,
and then copy the data into the various buffers. Similarly, for a
write, the low-level device driver could copy the data from many
nonconsecutive buffers into a single memory area referred by
buffer
and then perform the whole data transfer at
once.
Figure 13-3 illustrates a request descriptor
encompassing three blocks. The buffers of two of them are consecutive
in RAM, while the third buffer is by itself. The corresponding buffer
heads identify the logical blocks on the block device; the blocks
must necessarily be adjacent. Each logical block includes two
sectors. The sector
field of the request
descriptor points to the first sector of the first block on disk, and
the b_reqnext
field of each buffer head points to
the next buffer head.
During the initialization phase, each block device driver usually
allocates a fixed number of request descriptors to handle its
forthcoming I/O requests. The blk_init_queue( )
function sets up two equally sized lists of free request descriptors:
one for the READ
operation and another for the
WRITE
operations. The size of these lists is set
to 64 if the RAM size is greater than 32 MB, or to 32 if the RAM size
is less than or equal to 32 MB. The status of all request descriptors
is set initially to RQ_INACTIVE
.
The fixed number of request descriptors may become, under very heavy
loads and high disk activity, a bottleneck. A dearth of free
descriptors may force processes to wait until an ongoing data
transfer terminates. Thus, a wait queue is used to queue processes
waiting for a free request
element. The
get_request_wait( )
tries to get a free request
descriptor and puts the current process to sleep in the wait queue if
none is found; the get_request( )
function is
similar but simply returns NULL
if no free request
descriptor is available.
A threshold value known as batch_requests
(set to
32 or to 16, depending on the RAM size) is used to cut down kernel
overhead; when releasing a request descriptor, processes waiting for
free request descriptors are not woken up unless there are at least
batch_requests
free descriptors. Conversely, when
looking for a free request descriptor, get_request_wait( )
relinquishes the CPU if there are fewer than
batch_requests
free descriptors.
Request queues are
represented by means of request queue descriptors; each of them is a
request_queue_t
data structure whose fields are
listed in Table 13-8.
Table 13-8. The fields of a request queue descriptor
Type |
Field |
Description |
---|---|---|
|
|
|
|
|
List of pending requests |
|
|
Methods of the elevator algorithm |
|
|
Strategy routine of the driver |
|
|
Method to append a block to the request |
|
|
Method to insert a block in front of the request |
|
|
Method to attempt merging an enlarged request with the adjacent ones |
|
|
Method that passes a request to a driver (usually it inserts the request in the proper queue) |
|
|
Method to plug the driver |
|
|
Private data of the device driver |
|
|
Task queue element for the plugging mechanism |
|
|
Flag denoting whether the driver is currently plugged |
|
|
Flag denoting whether the first request in queue is active when the driver is unplugged |
|
|
Request queue lock |
|
|
Wait queue for lack of request descriptors |
When the kernel initializes a device driver, it creates and fills a request queue descriptor for each request queue handled by the driver.
Essentially, a request queue is a doubly linked list whose elements
are request descriptors (that is, request
data
structures). The queue_head
field of the request
queue descriptor stores the head (the first dummy element) of the
list, while the pointers in the queue
field of the
request descriptor link any request to the previous and next elements
in the list. The ordering of the elements in the queue list is
specific to each block device driver; the Linux kernel offers,
however, two predefined ways of ordering elements, which are
discussed in the later section Section 13.4.6.2.
Each block device driver may
define one or more request queues. To keep track of the request
queues of each driver, a low-level driver descriptor
is used. The descriptor is a data
structure of type blk_dev_struct
, whose fields are
listed in Table 13-9. The descriptors for all the
block devices are stored in the blk_dev
table,
which is indexed by the major number of the block device.
Table 13-9. The fields of a block device driver descriptor
Type |
Field |
Description |
---|---|---|
|
|
Common request queue (for drivers that do not define per-device queues) |
|
|
Method returning the address of a per-device queue |
|
|
Data (e.g., minor number) used by |
If the block device driver has a unique request queue for all
physical block devices, its address is stored in the
request_queue
field. Conversely, if the block
device driver maintains several queues, the queue
field points to a custom driver method that receives the identifier
of the block device file, selects one of the queues according to the
value of the data field, then returns the address of the proper
request queue.
The ll_rw_block( )
function creates a block device request. It is invoked from several
places in the kernel to trigger the I/O data transfer of one or more
blocks. The function receives the following parameters:
The type of operation, rw
, whose value can be
READ
, WRITE
, or READA
. The last operation type differs from the former in that
the function does not block when a request descriptor is not
available.
The number, nr
, of blocks to be transferred.
A bhs
array of nr
pointers to
buffer heads describing the blocks (all of them must have the same
block size and must refer to the same block device).
The buffer heads were previously initialized, so each specifies the block number, the block size, and the virtual device identifier (see the earlier section Section 13.4.4).
The function performs the following actions:
Checks that the block size b_size
matches the
block size of the virtual device b_dev
for each
buffer head in the bhs
array.
If the operation is WRITE
, checks that the block
device is not read-only.
For each buffer head in the bhs
array, performs
the following steps:
Sets the BH_Lock
flag of the buffer head. If it is
already set by some other kernel thread, it skips that buffer.
Increments the b_count
field of the buffer head.
Sets the b_end_io
field of the buffer head to
end_buffer_io_sync( )
; that is, to the function
that updates the buffer head when the data transfer is completed (see
Section 13.4.7 later in
this chapter.)
If the block must be written, tests the BH_Dirty
flag of the buffer head in one of the following ways:
If BH_Dirty
is reset, executes the
b_end_io
method (the end_buffer_io_sync( )
function) and continues with the next buffer because
there is no need to write this block.
If BH_Dirty
is set, resets it and places the
buffer head in the list of locked buffer heads.
As a general rule, the caller of ll_rw_block( )
must set the BH_Dirty
flag for each block that is
going to be written. Therefore, if ll_rw_block( )
finds that the flag is clear, then the block is already involved in a
write operation, so nothing has to be done.
If the block must be read, tests the BH_Uptodate
flag of the buffer head. If it is set, executes the
b_end_io
method (the end_buffer_io_sync( )
function) and continues with the next buffer. The kernel
never rereads a block from disk when its buffer contains valid
(up-to-date) data.
Invokes the submit_bh( )
function, which:
Determines the number of the first block’s sector on
the disk—that is, the value of the b_rsector
field—from the fields b_blocknr
(the logical
block number) and b_size
(the block size). This
field could be later modified by the block device driver if it
handles the Logical Volume Manager (LVM) or a RAID disk.
Sets the BH_Req
flag in b_state
to denote that the block has been requested.
Initializes the b_rdev
field from the
b_dev
field. As before, this field could be
modified later by the block device driver if it handles the LVM or a
RAID disk.
Invokes generic_make_request( ).
The generic_make_request( )
function posts the
request to the low-level driver. It receives as parameters the buffer
head bh
and the type of operation
rw
(READ
,
WRITE
, or READA
), and performs
the following operations:
Checks that bh->b_rsector
does not exceed the
number of sectors of the block device. If it does, prints a kernel
error message, invokes the b_end_io
method of the
buffer head, and terminates.
Extracts the major number maj
of the block device
driver from bh->b_rdev
.
Gets the descriptor of the device driver request queue from the
low-level driver descriptor blk_dev[maj]
. To do
this, it invokes the blk_dev[maj].queue
method, if
it is defined (the driver makes use of several queues); otherwise, it
reads the blk_dev[maj].request_queue
field (the
driver uses a single queue).
Invokes the make_request_fn
method of the request
queue descriptor identified in the previous step.
In most cases, block device drivers implement the
make_request_fn
method with the _ _make_request( )
function. It receives as parameters the
queue descriptor, the buffer head bh
, and the type
of operation rw
, and performs the following
operations:
Checks whether the type of the operation rw
is
READA
; in this case, it sets the local flag
rw_ahead
to 1 and sets rw
to
READ
.
Invokes the create_bounce( )
function, which looks
at the PG_highmem
flag in
bh->b_page->flags
and determines whether the
bh->b_data
buffer is stored in high memory or
not (see Section 7.1.6). If the buffer is
in high memory, the low-level driver might not be able to handle it.
Therefore, create_bounce( )
temporarily allocates
a new buffer in low memory and a new buffer head pointing to it. The
new buffer head is almost identical to bh
, except
for the b_data
field, which points to the new
buffer, the b_private
field, which points to the
original buffer head bh
, and the
b_end_io
method, which points to a custom method
that releases the low-memory buffer when the I/O operation
terminates. If rw
is WRITE
,
then the low-memory buffer is filled with the high memory buffer
contents by create_bounce( )
; otherwise, if
rw
is READ
, the low memory
buffer is copied into the high-memory buffer by the
b_end_io
custom method.
Checks whether the request queue is empty:
If the request queue is empty, inserts a new request descriptor in it and schedules activation of the strategy routine of the low-level driver at a later time.
If the request queue is not empty, inserts a new request descriptor in it, trying to cluster it with other requests that are already queued. As we’ll see shortly, there is no need to schedule the activation of the strategy routine.
Let’s look closer at these two cases.
As we saw earlier, it’s expedient to delay activation of the strategy routine in order to increase the chances of clustering requests for adjacent blocks. The delay is accomplished through a technique known as device plugging and unplugging. As long as a block device driver is plugged, its strategy routine is not activated even if there are requests to be processed in the driver’s queues.
If the real device’s request queue is empty and the
device isn’t already plugged, _ _make_request( )
carries out a device plugging
. The plug_device_fn
method that performs this task is usually implemented by means of the
generic_plug_device( )
function. This function
sets the plugged
field of the request queue
descriptor to 1 and inserts the plug _tq
task
queue element (statically included in the request queue descriptor)
in the tq_disk
task queue (see Section 4.7.3.1) to cause the device’s
strategy routine to be activated later.
The _ _make_request( )
function then allocates a
new request descriptor by invoking get_request( )
.
If no request descriptor is available, the function checks the value
of the rw_ahead
flag. If it is set, then the
function is handling a relatively unimportant read-ahead operation,
thus it invokes the b_end_io
method and terminates
without performing the I/O data transfer. Otherwise, the function
invokes the get_request_wait( )
function to force
the process to sleep until a request descriptor is freed.
Next, _ _make_request( )
initializes the new
request descriptor with the information read from the buffer head,
inserts it into the proper real device’s request
queue, and terminates.
How is the actual I/O data transfer started? The kernel checks
periodically whether the tq _disk
task queue
contains any elements. This occurs in a kernel thread such as
kswapd, or when the kernel must wait for some
resource related to block device drivers, such as buffers or request
descriptors. During the tq _disk
check, the kernel
removes any element in the queue and executes the corresponding
function.
Usually, the function stored in any plug_tq
task
queue points to the generic_unplug _device( )
function, which resets the plugged
field of the
request queue descriptor and invokes its
request_fn
method, thus executing the low-level
driver’s strategy routine. This activity is referred
to as unplugging
the device. As a result, the
requests included in the queues of the driver are processed, and the
corresponding I/O data transfers take place.
If the request queue is not empty, the driver was already plugged when the kernel inserted the first request in the queue. Therefore, there is no need to schedule the activation of the strategy routine again. Either the low-level driver is already unplugged, or it soon will be.
Notice that if _ _make_request( )
finds that the
queue is not empty, the low-level driver could be actively handling
the requests of the queue. Nevertheless, the function can safely
modify the queue because the low-level driver usually removes the
requests from the queue before processing them.
As a particular case, however, the function never touches the first
request in the queue when the head_active
field of
the request queue descriptor is set. This flag is set when the
low-level driver’s policy is always to process the
first request in the queue and not to remove the request from the
queue until the I/O data transfer completes.
The _ _make_request( )
function must either add a
new element in the queue or include the new block in an existing
request; the second case is known as block clustering
.
Block clustering requires that all the following conditions be satisfied:
The block to be inserted belongs to the same block device as the other blocks in the request and is adjacent to them: it either immediately precedes the first block in the request or immediately follows the last block in the request.
The blocks in the request have the same I/O operation type
(READ
or WRITE
) as the block to
be inserted.
The extended request does not exceed the allowed maximum number of
sectors. This value is stored in the max_sectors
table, which is indexed by the major and minor number of the block
device. The default value is 255 sectors.
The extended request does not exceed the allowed maximum number of segments (see Section 13.4.5.1 earlier in this chapter), which is usually 128.
No process is waiting for the completion of request—i.e., the
waiting
field of the request descriptor is
NULL
.
When the _ _make_request( )
function must
determine how to insert the requested block in the queue, it uses a
program that is traditionally called an elevator algorithm
. The elevator algorithm basically
defines the ordering of the elements in the queue; usually, this
ordering is also followed by the low-level driver when it is handling
the requests.
Although each block device driver may define its own elevator algorithm, most block device drivers use either one of the following:
ELEVATOR_NOOP
algorithmNew requests descriptors are inserted at the end of the queue. Therefore, older requests precede younger ones. Notice that block clustering enlarges a request but doesn’t make it younger. This algorithm favors fairness of servicing time among the various requests.
ELEVATOR_LINUS
algorithmThe queue elements tend to be ordered by the position of the corresponding sectors on the block device. This algorithm tries to minimize the number and extent of seek operations on the physical device. However, the algorithm must also rely on an ageing mechanism to avoid making requests in the last positions of the queue to remain unhanded for long periods of time. When searching for a request that might include the block, the algorithm starts from the bottom of the queue and interrupts the scanning as soon as it finds a very old request.
The elevator algorithm is implemented by three methods included in
the elevator
field of the request queue
descriptor:
elevator_merge_fn
Scans the queue and searches a candidate request for the block clustering. If block clustering is not possible, it also returns the position where the new request should be inserted. Otherwise, it returns the request that has to be enlarged in order to include the blocks in the new request.
elevator_merge_cleanup_fn
Invoked after a successful block clustering operation. It should
increase the age of all requests in the queue that follow the
enlarged request. (The method does nothing in the
ELEVATOR_NOOP
algorithm).
elevator_merge_req_fn
Invoked when the kernel merges two existing requests of the queue. It
should assign the age of the new enlarged request. (The method does
nothing in the ELEVATOR_NOOP
algorithm).
To add an existing request to the front or the back of a request, the
_ _make_request( )
function uses
back_merge_fn
and
front_merge_fn
methods of the request queue
descriptor, respectively. After a successful block clustering
operation, _ _make_request( )
also checks whether
the enlarged request can be merged with the previous or the next
request in the queue by invoking the
merge_requests_fn
method of the request queue
descriptor.
We have now reached the lowest level in Linux’s block device handling. This level is implemented by the strategy routine, which interacts with the physical block device to satisfy the requests collected in the queue.
As mentioned earlier, the strategy routine is usually started after inserting a new request in an empty request queue. Once activated, the low-level driver should handle all requests in the queue and terminate when the queue is empty.
A naïve implementation of the strategy routine could be the following: for each element in the queue, interact with the block device controller to service the request and wait until the data transfer completes. Then remove the serviced request from the queue and proceed with the next one.
Such an implementation is not very efficient. Even assuming that data
can be transferred using DMA, the strategy routine must suspend
itself while waiting for I/O completion; hence, an unrelated user
process would be heavily penalized. (The strategy routine does not
necessarily execute on behalf of the process that has requested the
I/O operation but at a random, later time, since it is activated by
means of the tq _disk
task queue.)
Therefore, many low-level drivers adopt the following strategy:
The strategy routine handles the first request in the queue and sets up the block device controller so that it raises an interrupt when the data transfer completes. Then the strategy routine terminates.
When the block device controller raises the interrupt, the interrupt handler activates a bottom half. The bottom half handler removes the request from the queue and reexecutes the strategy routine to service the next request in the queue.
Basically, low-level drivers can be further classified into the following:
Drivers that service each block in a request separately
Drivers that service several blocks in a request together
Drivers of the second type are much more complicated to design and implement than drivers of the first type. Indeed, although the sectors are adjacent on the physical block devices, the buffers in RAM are not necessarily consecutive. Therefore, any such driver may have to allocate a temporary area for the DMA data transfer, and then perform a memory-to-memory copy of the data between the temporary area and each buffer in the request’s list.
Since requests handled by both types of drivers consist of adjacent blocks, disk performance is enhanced because fewer seek commands are issued. However, the second type of drivers do not further reduce the number of seek commands, so transferring several blocks from disk together is not as effective in boosting disk performance.
The kernel doesn’t offer any support for the second type of drivers: they must handle the request queues and the buffer head lists on their own. The choice to leave the job up to the driver is not capricious or lazy. Each physical block device is inherently different from all others (for example, a floppy driver groups blocks in disk tracks and transfers a whole track in a single I/O operation), so making general assumptions on how to service each clustered request makes very little sense.
However, the kernel offers a limited degree of support for the low-level drivers in the first class. We’ll spend a little more time describing such drivers.
A typical strategy routine should perform the following actions:
Get the current request from a request queue. If all request queues are empty, terminate the routine.
Check that the current request has consistent information. In
particular, compare the major number of the block device with the
value stored in the rq _rdev
field of the request
descriptor. Moreover, check that the first buffer head in the list is
locked (the BH_Lock
flag should have been set by
ll_rw_block( )
).
Program the block device controller for the data transfer of the
first block. The data transfer direction can be found in the
cmd
field of the request descriptor and the
address of the buffer in the buffer
field, while
the initial sector number and the number of sectors to be transferred
are stored in the sector
and
current_nr_sectors
fields, respectively.[96] Also, set up the block device
controller so that an interrupt is raised when the DMA data transfer
completes.
If the routine is handling a block device file for which
ll_rw_block( )
accomplishes block clustering,
increment the sector
field and decrement the
nr_sectors
field of the request descriptor to keep
track of the blocks to be transferred.
The interrupt handler associated with the termination of the DMA data
transfer for the block device should invoke (either directly or via a
bottom half) the end_request
function (or a custom
function of the block device driver that does the same things). The
function, which receives as parameters the value 1 if the data
transfer succeeds or the value 0 if an error occurrs, performs the
following operations:
If an error occurred (parameter value is 0), updates the
sector
and nr_sectors
fields so
as to skip the remaining sectors of the block. In Step 3a, the buffer
content is also marked as not up-to-date.
Removes the buffer head of the transferred block from the request’s list.
Invokes the b_end_io
method of the buffer head.
When the ll_rw_block( )
function allocates the
buffer head, it loads this field with the address of the
end_buffer_io_sync( )
function, which essentially
performs two operations:
Sets the BH_Uptodate
flag of the buffer head to 1
or 0, according to the success or failure of the data transfer
Clears the BH_Lock
, BH_Wait_IO
,
and BH_launder
flags of the buffer head and wakes
up all processes sleeping in the wait queue to which the
b_wait
field of the buffer head points
The b_end_io
field could also point to other
functions. For instance, if the create_bounce( )
function created a temporary buffer in low memory, the
b_end_io
field points to a suitable function that
updates the original buffer in high memory and then invokes the
b_end_io
method of the original buffer head.
If there is another buffer head on the request’s
list, sets the current_nr_sectors
field of the
request descriptor to the number of sectors of the new block.
Sets the buffer
field with the address of the new
buffer (from the b_data
field of the new buffer
head).
Otherwise, if the request’s list is empty, all blocks have been processed. Therefore, it performs the following operations:
Removes the request descriptor from the request queue
Wakes up any process waiting for the request to complete
(waiting
field in the request descriptor)
Sets the rq _status
field of the request to
RQ _INACTIVE
Puts the request descriptor in the list of free requests
After invoking end_request
, the low-level driver
checks whether the request queue is empty. If it is not, the strategy
routine is executed again. Notice that end_request
actually performs two nested iterations: the outer one on the
elements of the request queue and the inner one on the elements in
the buffer head list of each request. The strategy routine is thus
invoked once for each block in the request queue.
We’ll discuss in the forthcoming chapters how the kernel uses the block device drivers. We’ll see that there are a number of cases in which the kernel activates disk I/O data transfers. However, let’s describe here the two fundamental kinds of I/O data transfer for block devices:
Here the I/O operation transfers a single block of data, so the transferred data can be kept in a single RAM buffer. The disk address consists of a device number and a block number. The buffer is associated with a specific disk block, which is identified by the major and minor numbers of the block device and by the logical block number.
Here the I/O operation transfers as many blocks of data as needed to fill a single page frame (the exact number depends both on the disk block size and on the page frame size). If the size of a page frame is a multiple of the block size, several disk blocks are transferred in a single I/O operation. Each page frame contains data belonging to a file. Since this data is not necessarily stored in adjacent disk blocks, it is identified by the file’s inode and by an offset within the file.
Block I/O operations are most often used when the kernel reads or writes single blocks in a filesystem (for example, a block containing an inode or a superblock). Conversely, page I/O operations are used mainly for reading and writing files (both regular files and block device files), for accessing files through the memory mapping, and for swapping.
Both kinds of I/O operations rely on the same functions to access a block device, but the kernel uses different algorithms and buffering techniques with them.
The bread( )
function reads a single block from a block device and stores it in a
buffer. It receives as parameters the device identifier, the block
number, and the block size, and returns a pointer to the buffer head
of the buffer containing the block.
The function performs the following operations:
Invokes the getblk( )
function to search for the
block in a software cache called the buffer cache (see Section 14.2.4). If the block is not included in the cache,
getblk( )
allocates a new buffer for it.
Invokes mark_page_accessed( )
on the buffer page
containing the data (see Section 16.7.2).
If the buffer already contains valid data, it terminates.
Invokes ll_rw_block( )
to start the
READ
operation (see Section 13.4.6 earlier in this chapter).
Waits until the data transfer completes. This is done by invoking a
function named wait_on_buffer( )
, which inserts
the current
process in the
b_wait
wait queue and suspends the process until
the buffer is unlocked.
Checks whether the buffer contains valid data. If so, it returns the
address of the buffer head; otherwise, it returns a
NULL
pointer.
No function exists to directly write a block to disk. Declaring a buffer dirty is sufficient to force its flushing to disk at some later time. In fact, write operations are not considered critical for system performance, so they are deferred whenever possible (see Section 14.2.4).
Block devices transfer information one block at a time, while process address spaces (or to be more precise, memory regions allocated to the process) are defined as sets of pages. This mismatch can be hidden to some extent by using page I/O operations. They may be activated in the following cases:
A process issues a read( )
or write( )
system call on a file (see Chapter 15).
A process reads a location of a page that maps a file in memory (see Chapter 15).
The kernel flushes some dirty pages related to a file memory mapping to disk (see Section 15.2.5).
When swapping in or swapping out, the kernel loads from or saves to disk the contents of whole page frames (see Chapter 16).
Page I/O operations can be activated by several kernel functions. In
this section, we’ll present the brw_page( )
function used to read or write swap pages (see Chapter 16). Other functions that start page I/O
operations are discussed in Chapter 15.
The brw_page( )
function
receives the following parameters:
rw
Type of I/O operation (READ
,
WRITE
, or READA
)
page
Address of a page descriptor
dev
Block device number (major and minor numbers)
b
Array of logical block numbers
size
Block size
The page descriptor refers to the page involved in the page I/O
operation. It must already be locked (PG_locked
flag on) before invoking brw_page( )
so that no
other kernel control path can access it. The page is considered as
split into 4096/size
buffers; the i
th buffer in the page is
associated with the block b[i]
of device
dev
.
The function performs the following operations:
Checks the page->buffers
field; if it is
NULL
, invokes create_empty_buffers( )
to allocate temporary buffer heads for all buffers
included in the page (such buffer heads are called asynchronous; they
are discussed in Section 14.2.1). The
address of the buffer head for the first buffer in the page is stored
in the page->buffers
field. The
b_this_page
field of each buffer head points to
the buffer head of the next buffer in the page.
Conversely, if the page->buffers
field is not
NULL
, the kernel does not need to allocate
temporary buffer heads. In fact, in this case, the page stores some
buffers already included in the buffer cache, presumably because some
of them were previously involved in block I/O operations (see
Section 14.2.2 for further details).
For each buffer head in the page, performs the following substeps:
Sets the BH_Lock
(locks the buffer for the I/O
data transfer) and the BH_Mapped
(the buffer maps
a file on disk) flags of the buffer head.
Stores in the b_blocknr
field the value of the
corresponding element of the array b
.
Since it is an asynchronous buffer head, sets the
BH_Async
flag, and in the
b_end_io
field, stores a pointer to
end_buffer_io_async( )
(described next).
For each buffer head in the page, invokes submit_bh( )
to request the buffer (see Section 13.4.6 earlier in this chapter.)
The submit_bh( )
function activates the device
driver of the block device being accessed. As described in the
earlier section Section 13.4.7, the
device driver performs the actual data transfer and then invokes the
b_end_io
method of all asynchronous buffer heads
that have been transferred. The b_end_io
field
points to the end_buffer_io_async( )
function,
which performs the following operations:
Sets the BH_Uptodate
flag of the asynchronous
buffer head according to the result of the I/O operation.
If the BH_Uptodate
flag is off, sets the
PG_error
flag of the page descriptor because an
error occurred while transferring the block. The function gets the
page descriptor address from the b_page
field of
the buffer head.
Gets the page_update_lock
spin lock.
Clears both the BH_Async
and the
BH_Lock
flags of the buffer head, and awakens each
process waiting for the buffer.
If any of the buffer heads in the page are still locked (i.e., the
I/O data transfer is not yet terminated), releases the
page_update_lock
spin lock and returns.
Otherwise, releases the page_update_lock
spin lock
and checks the PG_error
flag of the page
descriptor. If it is cleared, then all data transfers on the page
have successfully completed, so the function sets the
PG_uptodate
flag of the page descriptor.
Unlocks the page, clears the PG_locked
flag, and
wakes any process sleeping on page->wait
wait
queue.
Notice that once the page I/O operation terminates, the temporary
buffer heads allocated by create_empty_buffers( )
are not automatically released. As we shall see in Chapter 16, the temporary buffer heads are released only
when the kernel tries to reclaim some memory.
[96] Recall that current_nr_sectors
contains the
number of sectors in the first block of the request, while
nr_sectors
contains the total number of sectors in
the request.