When a file can be accessed by more than one process, a synchronization problem occurs. What happens if two processes try to write in the same file location? Or again, what happens if a process reads from a file location while another process is writing into it?
In traditional Unix systems, concurrent accesses to the same file location produce unpredictable results. However, Unix systems provide a mechanism that allows the processes to lock a file region so that concurrent accesses may be easily avoided.
The POSIX standard requires a file-locking mechanism based on the
fcntl( )
system call. It is possible to lock an
arbitrary region of a file (even a single byte) or to lock the whole
file (including data appended in the future). Since a process can
choose to lock just a part of a file, it can also hold multiple locks
on different parts of the file.
This kind of lock does not keep out another process that is ignorant of locking. Like a critical region in code, the lock is considered “advisory” because it doesn’t work unless other processes cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX’s locks are known as advisory locks .
Traditional BSD variants implement advisory locking through the
flock( )
system call. This call does not allow a
process to lock a file region, just the whole file.
Traditional System V variants provide the lockf( )
function, which is just an interface to fcntl( )
.
More importantly, System V Release 3 introduced mandatory locking
: the kernel checks that every invocation
of the open( )
, read( )
, and
write( )
system calls does not violate a mandatory
lock on the file being accessed. Therefore, mandatory locks are
enforced even between noncooperative processes.[88] A file is marked as a candidate for
mandatory locking by setting its set-group bit (SGID) and clearing
the group-execute permission bit. Since the set-group bit makes no
sense when the group-execute bit is off, the kernel interprets that
combination as a hint to use mandatory locks instead of advisory
ones.
Whether processes use advisory or mandatory locks, they can use both shared read locks and exclusive write locks . Any number of processes may have read locks on some file region, but only one process can have a write lock on it at the same time. Moreover, it is not possible to get a write lock when another process owns a read lock for the same file region, and vice versa (see Table 12-18).
Table 12-18. Whether a lock is granted
Grant request for | ||
---|---|---|
Current Locks |
Read lock? |
Write lock? |
No lock |
Yes |
Yes |
Read lock |
Yes |
No |
Write lock |
No |
No |
Linux supports all fashions of file locking: advisory and mandatory
locks, as well as the fcntl( )
, flock( )
, and the lockf( )
system calls.
However, the lockf( )
system call is just a
library wrapper routine, and therefore is not discussed here.
fcntl( )
’s mandatory locks can be
enabled and disabled on a per-filesystem basis using the
MS_MANDLOCK
flag (the mand
option) of the mount( )
system call. The default
is to switch off mandatory locking. In this case, fcntl( )
creates advisory locks. When the flag is set,
fcntl( )
produces mandatory locks if the file has
the set-group bit on and the group-execute bit off; it produces
advisory locks otherwise.
In earlier Linux versions, the flock( )
system
call produced only advisory locks, without regard of the
MS_MANDLOCK
mount flag. This is the expected
behavior of the system call in any Unix-like operating system. In
Linux 2.4, however, a special kind of flock( )
’s mandatory lock has been added to allow
proper support for some proprietary network filesystem
implementations. It is the so-called share-mode mandatory lock
; when set, no other process may open a
file that would conflict with the access mode of the lock. Use of
this feature for native Unix applications is discouraged, because the
resulting source code will be nonportable.
Another kind of flock( )
-based mandatory lock
called leases
has been introduced in Linux 2.4. When a
process tries to open a file protected by a lease, it is blocked as
usual. However, the process that owns the lock receives a signal.
Once informed, it should first update the file so that its content is
consistent, and then release the lock. If the owner does not do this
in a well-defined time interval (tunable by writing a number of
seconds into /proc/sys/fs/lease-break-time
,
usually 45 seconds), the lease is automatically removed by the kernel
and the blocked process is allowed to continue.
Beside the checks in the read( )
and
write( )
system calls, the kernel takes into
consideration the existence of mandatory locks when servicing all
system calls that could modify the contents of a file. For instance,
an open( )
system call with the
O_TRUNC
flag set fails if any mandatory lock
exists for the file.
A lock produced by fcntl( )
is of type
FL_POSIX
, while a lock produced by flock( )
is of type FL_FLOCK
,
FL_MAND
(for share-mode locks), or
FL_LEASE
(for leases). The types of locks produced
by fcntl( )
may safely coexist with those produced
by flock( )
, but neither one has any effect on the
other. Therefore, a file locked through fcntl( )
does not appear locked to flock( )
, and vice
versa.
The following section describes the main data structure used by the
kernel to handle file locks. The next two sections examine the
differences between the two most common lock types:
FL_POSIX
and FL_FLOCK
.
The file_lock
data structure represents file
locks; its fields are shown in Table 12-19. All
file_lock
data structures are included in a doubly
linked list. The address of the first element is stored in
file_lock_list
, while the fields
fl_nextlink
and fl_prevlink
store the addresses of the adjacent elements in the list.
Table 12-19. The fields of the file_lock data structure
Type |
Field |
Description |
---|---|---|
|
|
Next element in inode list |
|
|
Pointers for global list |
|
|
Pointers for process list |
|
|
Owner’s |
|
|
PID of the process owner |
|
|
Wait queue of blocked processes |
|
|
Pointer to file object |
|
|
Lock flags |
|
|
Lock type |
|
|
Starting offset of locked region |
|
|
Ending offset of locked region |
|
|
Function to call when lock is unblocked |
|
|
Function to call when lock is inserted |
|
|
Function to call when lock is removed |
|
|
Used for lease break notifications |
|
|
Filesystem-specific information |
All lock_file
structures that refer to the same
file on disk are collected in a simply linked list, whose first
element is pointed to by the i_flock
field of the
inode object. The fl_next
field of the
lock_file
structure specifies the next element in
the list.
When a process tries to get an advisory or mandatory lock, it may be
suspended until the previously allocated lock on the same file region
is released. All processes sleeping on some lock are inserted into a
wait queue, whose head is stored in the fl_wait
field of the file_lock
structure. Moreover, all
processes sleeping on any file locks are inserted into a circular
doubly linked list, whose head (first dummy element) is stored in the
blocked_list
variable; the
fl_block
field of the file_lock
data structure stores the pointer to adjacent elements in the list.
An FL_FLOCK
lock is always associated with a file
object and is thus maintained by a particular process (or clone
processes sharing the same opened file). When a lock is requested and
granted, the kernel replaces any other lock that the process is
holding on the same file object.
This happens only when a process wants to change an already owned
read lock into a write one, or vice versa. Moreover, when a file
object is being freed by the fput( )
function, all
FL_FLOCK
locks that refer to the file object are
destroyed. However, there could be other FL_FLOCK
read locks set by other processes for the same file (inode), and they
still remain active.
The flock( )
system call acts on two parameters:
the fd
file descriptor of the file to be acted
upon and a cmd
parameter that specifies the lock
operation. A cmd
parameter of
LOCK_SH
requires a shared lock for reading,
LOCK_EX
requires an exclusive lock for writing,
and LOCK_UN
releases the lock. If the
LOCK_NB
value is ORed to the
LOCK_SH
or LOCK_EX
operation,
the system call does not block; in other words, if the lock cannot be
immediately obtained, the system call returns an error code. Note
that it is not possible to specify a region inside the file—the
lock always applies to the whole file.
When the sys_flock( )
service routine is invoked,
it performs the following steps:
Checks whether fd
is a valid file descriptor; if
not, returns an error code. Gets the address of the corresponding
file object.
If the process has to acquire an advisory lock, checks that the process has both read and write permission on the open file; if not, returns an error code.
Invokes flock_lock_file( )
, passing as parameters
the file object pointer filp
, the type
type
of lock operation required, and a flag
wait
. This last parameter is set if the system
call should block (LOCK_NB
clear) and cleared
otherwise (LOOK_NB
set). This function performs,
in turn, the following actions:
If the lock must be acquired, gets a new file_lock object and fills it with the appropriate lock operation.
Searches the list that
filp->f_dentry->d_inode->i_flock
points
to. If an FL_FLOCK
lock for the same file object
is found and an unlock operation is required, removes the
file_lock
element from the inode list and the
global list, wakes up all processes sleeping in the
lock’s wait queue, frees the
file_lock
structure, and returns.
Otherwise, searches the inode list again to verify that no existing
FL_FLOCK
lock conflicts with the requested one.
There must be no FL_FLOCK
write lock in the inode
list, and moreover, there must be no FL_FLOCK
lock
at all if the processing is requesting a write lock. However, a
process may want to change the type of lock it already owns; this is
done by issuing a second flock( )
system call.
Therefore, the kernel always allows the process to change locks that
refer to the same file object. If a conflicting lock is found and the
LOCK_NB
flag was specified, the function returns
an error code; otherwise, it inserts the current process in the
circular list of blocked processes and suspends it.
If no incompatibility exists, inserts the
file_lock
structure into the global lock list and
the inode list, and then returns 0 (success).
Returns the return code of flock_lock_file( )
.
An FL_POSIX
lock is always associated with a
process and with an inode; the lock is
automatically released either when the process dies or when a file
descriptor is closed (even if the process opened the same file twice
or duplicated a file descriptor). Moreover,
FL_POSIX
locks are never inherited by the child
across a fork( )
.
When used to lock files, the fcntl( )
system call
acts on three parameters: the fd
file descriptor
of the file to be acted upon, a cmd
parameter that
specifies the lock operation, and an fl
pointer to
a flock
data structure. Version 2.4 of Linux also
defines a flock64
structure, which uses 64-bit
fields for the file offset and length fields. In the following, we
focus on the flock
data structure, but the
description is valid for flock64
too.
Locks of type FL_POSIX
are able to protect an
arbitrary file region, even a single byte. The region is specified by
three fields of the flock
structure.
l_start
is the initial offset of the region and is
relative to the beginning of the file (if field
l_whence
is set to SEEK_SET
),
to the current file pointer (if l_whence
is set to
SEEK_CUR
), or to the end of the file (if
l_whence
is set to SEEK_END
).
The l_len
field specifies the length of the file
region (or 0, which means that the region includes all potential
writes past the current end of the file).
The sys_fcntl( )
service routine behaves
differently, depending on the value of the flag set in the
cmd
parameter:
F_GETLK
Determines whether the lock described by the flock
structure conflicts with some FL_POSIX
lock
already obtained by another process. In this case, the
flock
structure is overwritten with the
information about the existing lock.
F_SETLK
Sets the lock described by the flock
structure. If
the lock cannot be acquired, the system call returns an error code.
F_SETLKW
Sets the lock described by the flock
structure. If
the lock cannot be acquired, the system call blocks; that is, the
calling process is put to sleep.
F_GETLK64
, F_SETLK64
, F_SETLKW64
Identical to the previous ones, but the flock64
data structure is used rather than flock
.
When sys_fcntl( )
acquires a lock, it performs the
following:
Reads the flock
structure from user space.
Gets the file object corresponding to fd
.
Checks whether the lock should be a mandatory one and the file has a
shared memory mapping (see Chapter 15).
In this case, refuses to create the lock and returns the
-EAGAIN
error code; the file is already being
accessed by another process.
Initializes a new file_lock
structure according to
the contents of the user’s flock
structure.
Terminates returning an error code if the file does not allow the access mode specified by the type of the requested lock.
Invokes the lock
method of the file operations, if
defined.
Invokes the posix_lock_file( )
function, which
executes the following actions:
Invokes posix_locks_conflict( )
for each
FL_POSIX
lock in the inode’s lock
list. The function checks whether the lock conflicts with the
requested one. Essentially, there must be no
FL_POSIX
write lock for the same region in the
inode list, and there may be no FL_POSIX
lock at
all for the same region if the process is requesting a write lock.
However, locks owned by the same process never conflict; this allows
a process to change the characteristics of a lock it already owns.
If a conflicting lock is found and fcntl( )
was
invoked with the F_SETLK
or
F_SETLK64
flag, returns an error code. Otherwise,
the current process should be suspended. In this case, invokes
posix_locks_deadlock( )
to check that no deadlock
condition is being created among processes waiting for
FL_POSIX
locks, and then inserts the current
process in the circular list of blocked processes and suspends it.
As soon as the inode’s lock list includes no
conflicting lock, checks all the FL_POSIX
locks of
the current process that overlap the file region that the current
process wants to lock, and combines and splits adjacent areas as
required. For example, if the process requested a write lock for a
file region that falls inside a read-locked wider region, the
previous read lock is split into two parts covering the
nonoverlapping areas, while the central region is protected by the
new write lock. In case of overlaps, newer locks always replace older
ones.
Inserts the new file_lock
structure in the global
lock list and in the inode list.
[88] Oddly enough, a process may still unlink (delete) a file even if some other process owns a mandatory lock on it! This perplexing situation is possible because when a process deletes a file hard link, it does not modify its contents, but only the contents of its parent directory.