Kernel developers are expected to identify and solve the synchronization problems raised interleaving by kernel control paths. However, avoiding race conditions is a hard task because it requires a clear understanding of how the various components of the kernel interact. To give a feeling of what’s really inside the kernel code, let’s mention a few typical usages of the synchronization primitives defined in this chapter.
Reference counters are widely used inside the kernel to avoid race
conditions due to the concurrent allocation and releasing of a
resource. A reference counter
is just an atomic_t
counter associated with a specific resource like a memory page, a
module, or a file. The counter is atomically incremented when a
kernel control path starts using the resource, and it is decremented
when a kernel control path finishes using the resource. When the
reference counter becomes zero, the resource is not being used, and
it can be released if necessary.
In earlier Linux kernel versions, a global kernel lock (also known as big kernel lock , or BKL) was widely used. In Version 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 kernel was considerably more flexible and no longer relied on a single spin lock; rather, a large number of kernel data structures were protected by specialized spin locks. The global kernel lock, on other hand, was still present because splitting a big lock into several smaller locks is not trivial — both deadlocks and race conditions must be carefully avoided. Several unrelated parts of the kernel code were still serialized by the global kernel lock.
Linux kernel Version 2.4 reduces still further the role of the global kernel lock. In the current stable version, the global kernel lock is mostly used to serialize accesses to the Virtual File System and avoid race conditions when loading and unloading kernel modules. The main progress with respect to the earlier stable version is that the networking transfers and file accessing (like reading or writing into a regular file) are no longer serialized by the global kernel lock.
The global kernel lock is a spin lock named
kernel_flag
. Every process descriptor includes a
lock_depth
field, which allows the same process to
acquire the global kernel lock several times. Therefore, two
consecutive requests for it will not hang the processor (as for
normal spin locks). If the process does not want the lock, the field
has the value -1. If the process wants it, the field value plus 1
specifies how many times the lock has been requested. The
lock_depth
field is crucial for interrupt
handlers, exception handlers, and bottom halves. Without it, any
asynchronous function that tries to get the global kernel lock could
generate a deadlock if the current process already owns the lock.
The lock_kernel( )
and unlock_kernel( )
functions are used to get and release the global kernel
lock. The former function is equivalent to:
if (++current->lock_depth == 0) spin_lock(&kernel_flag);
while the latter is equivalent to:
if (--current->lock_depth < 0) spin_unlock(&kernel_flag);
Notice that the if
statements of the
lock_kernel( )
and unlock_kernel( )
functions need not be executed atomically because
lock_depth
is not a global variable — each
CPU addresses a field of its own current process descriptor. Local
interrupts inside the if
statements do not induce
race conditions either. Even if the new kernel control path invokes
lock_kernel( )
, it must release the global kernel
lock before terminating.
Each
memory descriptor of type
mm_struct
includes its own semaphore in the
mmap_sem
field (see Section 8.2). The semaphore protects the descriptor
against race conditions that could arise because a memory descriptor
can be shared among several lightweight processes.
For instance, let’s suppose that the kernel must
create or extend a memory region for some process; to do this, it
invokes the do_mmap( )
function, which allocates a
new vm_area_struct
data structure. In doing so,
the current process could be suspended if no free memory is
available, and another process sharing the same memory descriptor
could run. Without the semaphore, any operation of the second process
that requires access to the memory descriptor (for instance, a Page
Fault due to a Copy on Write) could lead to severe data corruption.
The semaphore is implemented as a read/write semaphore because some kernel functions, such as the Page Fault exception handler (see Section 8.4), need only to scan the memory descriptors.
The list of slab cache descriptors (see
Section 7.2.2) is protected by the
cache_chain_sem
semaphore, which grants an
exclusive right to access and modify the list.
A race condition is possible when kmem_cache_create( )
adds a new element in the list, while
kmem_cache_shrink( )
and kmem_cache_reap( )
sequentially scan the list. However, these functions are
never invoked while handling an interrupt, and they can never block
while accessing the list. Since the kernel is nonpreemptive, these
functions cannot overlap on a uniprocessor system. However, this
semaphore plays an active role in multiprocessor systems.
As we shall
see in Chapter 12, Linux stores the information on
a disk file in a memory object called an inode.
The corresponding data structure includes its own semaphore in the
i_sem
field.
A huge number of race conditions can occur during filesystem handling. Indeed, each file on disk is a resource held in common for all users, since all processes may (potentially) access the file content, change its name or location, destroy or duplicate it, and so on. For example, let’s suppose that a process lists the files contained in some directory. Each disk operation is potentially blocking, and therefore even in uniprocessor systems, other processes could access the same directory and modify its content while the first process is in the middle of the listing operation. Or, again, two different processes could modify the same directory at the same time. All these race conditions are avoided by protecting the directory file with the inode semaphore.
Whenever a program uses two or more semaphores, the potential for
deadlock is present because two different paths could end up waiting
for each other to release a semaphore. Generally speaking, Linux has
few problems with deadlocks on semaphore requests, since each kernel
control path usually needs to acquire just one semaphore at a time.
However, in a couple of cases, the kernel must get two semaphore
locks. Inode semaphores are prone to this scenario; for instance,
this occurs in the service routines of the rmdir( )
and the rename( )
system calls (notice
that in both cases two different inodes are involved in the
operation, so both semaphores must be taken). To avoid such
deadlocks, semaphore requests are performed in address order: the
semaphore request whose semaphore
data structure
is located at the lowest address is issued first.