This chapter explains how User Mode processes can synchronize their actions and exchange data. We already covered several synchronization topics in Chapter 5, but the actors there were kernel control paths, not User Mode programs. We are now ready, after having discussed I/O management and filesystems at length, to extend the discussion to User Mode processes. These processes must rely on the kernel to facilitate interprocess synchronization and communication.
As we saw in Section 12.7.1, a form of synchronization among User Mode processes can be achieved by creating a (possibly empty) file and using suitable VFS system calls to lock and unlock it. While processes can similarly share data via temporary files protected by locks, this approach is costly because it requires accesses to the disk filesystem. For this reason, all Unix kernels include a set of system calls that supports process communication without interacting with the filesystem; furthermore, several wrapper functions were developed and inserted in suitable libraries to expedite how processes issue their synchronization requests to the kernel.
As usual, application programmers have a variety of needs that call for different communication mechanisms. Here are the basic mechanisms that Unix systems offer to allow interprocess communication:
Best suited to implement producer/consumer interactions among processes. Some processes fill the pipe with data, while others extract data from the pipe.
Represent, as the name implies, the User Mode version of the kernel semaphores discussed in Section 5.3.6.
Allow processes to exchange messages (short blocks of data) by reading and writing them in predefined message queues.
Allow processes to exchange information via a shared block of memory. In applications that must share large amounts of data, this can be the most efficient form of process communication.
Allow processes on different computers to exchange data through a network, as described in Chapter 18. Sockets can also be used as a communication tool for processes located on the same host computer; the X Window System graphic interface, for instance, uses a socket to allow client programs to exchange data with the X server.
Pipes are an interprocess communication mechanism that is provided in all flavors of Unix. A pipe is a one-way flow of data between processes: all data written by a process to the pipe is routed by the kernel to another process, which can thus read it.
In Unix command shells, pipes can be created by means of the
|
operator. For instance, the following statement
instructs the shell to create two processes connected by a pipe:
$ ls | more
The standard output of the first process, which executes the
ls
program, is redirected to the pipe; the
second process, which executes the more
program,
reads its input from the pipe.
Note that the same results can also be obtained by issuing two commands such as the following:
$ ls > temp $ more < temp
The first command redirects the output of ls
into a regular file; then the second command forces
more
to read its input from the same file. Of
course, using pipes instead of temporary files is usually more
convenient due to the following reasons:
The shell statement is much shorter and simpler.
There is no need to create temporary regular files, which must be deleted later.
Pipes may be considered open files that have no corresponding image
in the mounted filesystems. A process creates a new pipe by means of
the pipe( )
system call, which returns a pair of
file descriptors; the process may then pass these descriptors to its
descendants through fork( )
, thus sharing the pipe
with them. The processes can read from the pipe by using the
read( )
system call with the first file
descriptor; likewise, they can write into the pipe by using the
write( )
system call with the second file
descriptor.
POSIX defines only half-duplex pipes, so even though the
pipe( )
system call returns two file descriptors,
each process must close one before using the other. If a two-way flow
of data is required, the processes must use two different pipes by
invoking pipe( )
twice.
Several Unix systems, such as System V Release 4, implement full-duplex pipes. In a full-duplex pipe , both descriptors can be written into and read from, thus there are two bidirectional channels of information. Linux adopts yet another approach: each pipe’s file descriptors are still one-way, but it is not necessary to close one of them before using the other.
Let’s resume the previous example. When the command
shell interprets the ls|more
statement, it
essentially performs the following actions:
The first child process, which must execute the
ls
program, performs the following operations:
Invokes dup2(4,1)
to copy file descriptor 4 to
file descriptor 1. From now on, file descriptor 1 refers to the
pipe’s write channel.
Invokes the close( )
system call twice to release
file descriptors 3 and 4.
Invokes the execve( )
system call to execute the
ls
program (see Section 20.4). The program writes its output to the file
that has file descriptor 1 (the standard output); i.e., it writes
into the pipe.
The second child process must execute the more
program; therefore, it performs the following operations:
Invokes dup2(3,0)
to copy file descriptor 3 to
file descriptor 0. From now on, file descriptor 0 refers to the
pipe’s read channel.
Invokes the close( )
system call twice to release
file descriptors 3 and 4.
Invokes the execve( )
system call to execute
more
. By default, that program reads its input
from the file that has file descriptor 0 (the standard input); i.e.,
it reads from the pipe.
In this simple example, the pipe is used by exactly two processes. Because of its implementation, though, a pipe can be used by an arbitrary number of processes.[122] Clearly, if two or more processes read or write the same pipe, they must explicitly synchronize their accesses by using file locking (see Section 12.7.1) or IPC semaphores (see Section 19.3.3 later in this chapter).
Many Unix systems provide, besides the pipe( )
system call, two wrapper functions named popen( )
and pclose( )
that handle all the dirty work
usually done when using pipes. Once a pipe has been created by means
of the popen( )
function, it can be used with the
high-level I/O functions included in the C library (fprintf( )
, fscanf( )
, and so on).
In Linux, popen( )
and pclose( )
are included in the C library. The popen( )
function receives two parameters: the
filename
pathname of an executable file and a
type
string specifying the direction of the data
transfer. It returns the pointer to a FILE
data
structure. The popen( )
function essentially
performs the following operations:
Creates a new pipe by using the pipe( )
system
call
Forks a new process, which in turn executes the following operations:
If type
is r
, duplicates the
file descriptor associated with the pipe’s write
channel as file descriptor 1 (standard output); otherwise, if
type
is w
, duplicates the file
descriptor associated with the pipe’s read channel
as file descriptor 0 (standard input)
Closes the file descriptors returned by pipe( )
Invokes the execve( )
system call to execute the
program specified by filename
If type
is r
, closes the file
descriptor associated with the pipe’s write channel;
otherwise, if type
is w
, closes
the file descriptor associated with the pipe’s read
channel
Returns the address of the FILE
file pointer that
refers to whichever file descriptor for the pipe is still open
After the popen( )
invocation, parent and child
can exchange information through the pipe: the parent can read (if
type
is r
) or write (if
type
is w
) data by using the
FILE
pointer returned by the function. The data is
written to the standard output or read from the standard input,
respectively, by the program executed by the child process.
The pclose( )
function (which receives the file
pointer returned by popen( )
as its parameter)
simply invokes the wait4( )
system call and waits
for the termination of the process created by popen( )
.
We now have to start thinking again
on the system call level. Once a pipe is created, a process uses the
read( )
and write( )
VFS system
calls to access it. Therefore, for each pipe, the kernel creates an
inode object plus two file objects—one for reading and the
other for writing. When a process wants to read from or write to the
pipe, it must use the proper file descriptor.
When the inode object refers to a pipe, its i_pipe
field points to a pipe_inode_info
structure shown
in Table 19-1.
Table 19-1. The pipe_inode_info structure
Type |
Field |
Description |
---|---|---|
|
|
Pipe/FIFO wait queue |
|
|
Address of kernel buffer |
|
|
Number of bytes written into the buffer and yet to be read |
|
|
Read position in kernel buffer |
|
|
Flag for (or number of) reading processes |
|
|
Flag for (or number of) writing processes |
|
|
Number of reading processes sleeping in the wait queue |
|
|
Number of writing processes sleeping in the wait queue |
|
|
Like |
|
|
Like |
Besides one inode and two file objects, each pipe has its own
pipe buffer
—a single page frame containing the
data written into the pipe and yet to be read. The address of this
page frame is stored in the base
field of the
pipe_inode_info
structure. The
len
field of the structure stores the number of
bytes written into the pipe buffer that are yet to be read; in the
following, we call that number the current pipe size
.
The pipe buffer is circular and it is accessed both by reading and writing processes, so the kernel must keep track of two current positions in the buffer:
The offset of the next byte to be read, which is stored in the
start
field of the
pipe_inode_info
structure
The offset of the next byte to be written, which is derived from
start
and the pipe size (the
len
field of the structure)
To avoid race conditions on the pipe’s data
structures, the kernel prevents concurrent accesses to the pipe
buffer through the use of the i_sem
semaphore
included in the inode object.
A pipe is implemented as a set of VFS objects, which have no corresponding disk image. In Linux 2.4, these VFS objects are organized into the pipefs special filesystem to expedite their handling (see Section 12.3.1). Since this filesystem has no mount point in the system directory tree, users never see it. However, thanks to pipefs, the pipes are fully integrated in the VFS layer, and the kernel can handle them in the same way as named pipes or FIFOs, which truly exist as files recognizable to end users (see the later section Section 19.2).
The init_pipe_fs( )
function, typically executed
during kernel initialization, registers the
pipefs filesystem and mounts it (refer to the
discussion in Section 12.4.1):
struct file_system_type pipe_fs_type; root_fs_type.name = "pipefs"; root_fs_type.read_super = pipefs_read_super; root_fs_type.fs_flags = FS_NOMOUNT; register_filesystem(&pipe_fs_type); pipe_mnt = do_kern_mount("pipefs", 0, "pipefs", NULL);
The mounted filesystem object that represents the root directory of
pipefs is stored in the
pipe_mnt
variable.
The pipe( )
system call is serviced by the sys_pipe( )
function, which in turn invokes the do_pipe( )
function. To create a new pipe, do_pipe( )
performs the following operations:
Invokes the get_pipe_inode( )
function, which
allocates and initializes an inode object for the pipe in the
pipefs filesystem. In particular, this function
executes the following actions:
Allocates a pipe_inode_info
data structure and
stores its address in the i_pipe
field of the
inode.
Allocates a page frame for the pipe buffer and stores its starting
address in the base
field of the
pipe_inode_info
structure.
Initializes the start
, len
,
waiting_readers
, and
waiting_writers
fields of the
pipe_inode_info
structure to 0.
Initializes the r_counter
and
w_counter
fields of the
pipe_inode_info
structure to 1.
Sets the readers
and writers
fields of the pipe_inode_info
structure to 1.
Allocates a file object and a file descriptor for the read channel of
the pipe, sets the flag
field of the file object
to O_RDONLY
, and initializes the
f_op
field with the address of the read_pipe_fops
table.
Allocates a file object and a file descriptor for the write channel
of the pipe, sets the flag
field of the file
object to O_WRONLY
, and initializes the
f_op
field with the address of the write_pipe_fops
table.
Allocates a dentry object and uses it to link the two file objects and the inode object (see Section 12.1.1); then inserts the new inode in the pipefs special filesystem.
Returns the two file descriptors to the User Mode process.
The process that issues a pipe( )
system call is
initially the only process that can access the new pipe, both for
reading and writing. To represent that the pipe has both a reader and
a writer, the readers
and
writers
fields of the
pipe_inode_info
data structure are initialized to
1. In general, each of these two fields is set to 1 only if the
corresponding pipe’s file object is still opened by
a process; the field is set to 0 if the corresponding file object has
been released, since it is no longer accessed by any process.
Forking a new process does not increase the value of the
readers
and writers
fields, so
they never rise above 1;[123] however, it does increase the value of the
usage counters of all file objects still used by the parent process
(see Section 3.4.1). Thus, the objects are
not released even when the parent dies, and the pipe stays open for
use by the children.
Whenever a process invokes the close( )
system
call on a file descriptor associated with a pipe, the kernel executes
the fput( )
function on the corresponding file
object, which decrements the usage counter. If the counter becomes 0,
the function invokes the release
method of the
file operations (see Section 12.6.3 and Section 12.2.6).
According to whether the file is associated with the read or write
channel, the release
method is implemented by
either pipe_read_release( )
or
pipe_write_release( )
; both functions invoke
pipe_release( )
, which sets either the
readers
field or the writers
field of the pipe_inode_info
structure to 0. The
function checks whether both the readers
and
writers
fields are equal to 0; in this case, it
releases the page frame containing the pipe buffer. Otherwise, the
function wakes up any processes sleeping in the
pipe’s wait queue so they can recognize the change
in the pipe state.
A process wishing to get data from a pipe
issues a read( )
system call, specifying the file
descriptor associated with the pipe’s reading end.
As described in Section 12.6.2, the kernel
ends up invoking the read
method found in the file
operation table associated with the proper file object. In the case
of a pipe, the entry for the read method in the
read_pipe_fops
table points to the
pipe_read( )
function.
The pipe_read( )
function is quite involved, since
the POSIX standard specifies several requirements for the
pipe’s read operations. Table 19-2
summarizes the expected behavior of a read( )
system call that requests n bytes from a pipe
that has a pipe size (number of bytes in the pipe buffer yet to be
read) equal to p.
The system call might block the current process in two cases:
The pipe buffer is empty when the system call starts.
The pipe buffer does not include all requested bytes, and a writing process was previously put to sleep while waiting for space in the buffer.
Notice that the read operation can be nonblocking: in this case, it completes as soon as all available bytes (even none) are copied into the user address space.[124]
Notice also that the value 0 is returned by the read( )
system call only if the pipe is empty and no process is
currently using the file object associated with the
pipe’s write channel.
Table 19-2. Reading n bytes from a pipe
At least one writing process |
No writing process | |||
---|---|---|---|---|
Blocking read |
Nonblocking read | |||
Pipe Size p |
Sleeping writer |
No sleeping writer | ||
p = 0 |
Copy n bytes and return n, waiting for data when the pipe buffer is empty. |
Wait for some data, copy it, and return its size. |
Return |
Return 0. |
0 < p < n |
Copy p bytes and return p: bytes are left in the pipe buffer. | |||
p ≥ n |
Copy n bytes and return n: p-n bytes are left in the pipe buffer. |
The function performs the following operations:
Acquires the i_sem
semaphore of the inode.
Determines whether the pipe size, which is stored into the
len
field of the
pipe_inode_info
structure, is 0. In this case,
determines whether the function must return or whether the process
must be blocked while waiting until another process writes some data
in the pipe (see Table 19-2). The type of I/O
operation (blocking or nonblocking) is specified by the
O_NONBLOCK
flag in the f_flags
field of the file object. If the current process must be blocked, the
function performs the following actions:
Adds 1 to the waiting_readers
field of the
pipe_inode_info
structure.
Adds current
to the wait queue of the pipe (the
wait
field of the
pipe_inode_info
structure).
Releases the inode semaphore.
Sets the process status to TASK_INTERRUPTIBLE
and
invokes schedule( )
.
Once awake, removes current
from the wait queue,
acquires again the i_sem
inode semaphore,
decrements the waiting_readers
field, and then
jumps back to Step 2.
Copies the requested number of bytes (or the number of available bytes, if the buffer size is too small) from the pipe’s buffer to the user address space.
Updates the start
and len
fields of the pipe_inode_info
structure.
Invokes wake_up_interruptible( )
to wake up all
processes sleeping on the pipe’s wait queue.
If not all requested bytes have been copied, there is at least one
writing process currently sleeping
(waiting_writers
field greater than 0) and the
read operation is nonblocking, so the function jumps back to Step 2.
Releases the i_sem
semaphore of the inode.
Returns the number of bytes copied into the user address space.
A process wishing to put data into a pipe
issues a write( )
system call, specifying the file
descriptor for the writing end of the pipe. The kernel satisfies this
request by invoking the write
method of the proper
file object; the corresponding entry in the
write_pipe_fops
table points to the
pipe_write( )
function.
Table 19-3 summarizes the behavior, specified by
the POSIX standard, of a write( )
system call that
requested to write n bytes into a pipe having
u unused bytes in its buffer. In particular, the
standard requires that write operations involving a small number of
bytes must be atomically executed. More precisely, if two or more
processes are concurrently writing into a pipe, each write operation
involving fewer than 4,096 bytes (the pipe buffer size) must finish
without being interleaved with write operations of other processes to
the same pipe. However, write operations involving more than 4,096
bytes may be nonatomic and may also force the calling process to
sleep.
Table 19-3. Writing n bytes to a pipe
At least one reading process | |||
---|---|---|---|
Available buffer space u |
Blocking write |
Nonblocking write |
No reading process |
u<n≤4096 |
Wait until n-u bytes are freed, copy n bytes, and return n. |
Return |
Send |
n>4096 |
Copy n bytes (waiting when necessary) and return n. |
If u>0, copy u bytes and
return u; else return
| |
u≥n |
Copy n bytes and return n. |
Moreover, each write operation to a pipe must fail if the pipe does
not have a reading process (that is, if the
readers
field of the pipe’s inode
object has the value 0). In this case, the kernel sends a
SIGPIPE
signal to the writing process and
terminates the write( )
system call with the
-EPIPE
error code, which usually leads to the
familiar “Broken pipe” message.
The pipe_write( )
function performs the following
operations:
Acquires the i_sem
semaphore of the inode.
Checks whether the pipe has at least one reading process. If not, it
sends a SIGPIPE
signal to the
current
process, releases the inode semaphore, and
returns an -EPIPE
value.
Checks whether the number of bytes to be written is within the pipe’s buffer size:
If so, the write operation must be atomic. Therefore, checks whether the buffer has enough free space to store all bytes to be written.
Otherwise, if the number of bytes is greater than the buffer size, the operation can start as long as there is any free space at all. Therefore, the function checks for at least one free byte.
If the buffer does not have enough free space and the write operation
is nonblocking, releases the inode semaphore and returns the
-EAGAIN
error code.
If the buffer does not have enough free space and the write operation is blocking, performs the following actions:
Adds 1 to the waiting_writers
field of the
pipe_inode_info
structure.
Adds current
to the wait queue of the pipe (the
wait
field of the
pipe_inode_info
structure).
Releases the inode semaphore.
Sets the process status to TASK_INTERRUPTIBLE
and
invokes schedule( )
.
Once awake, removes current
from the wait queue,
again acquires the inode semaphore, decrements the
waiting_writers
field, and then jumps back to Step
5.
Now the pipe buffer has enough free space to either copy the requested number of bytes (if the write operation must be atomic) or copy at least one byte; notice that other writers cannot steal free space because this writer owns the inode semaphore. Copies the requested number of bytes (or the number of free bytes if the pipe size is too small) from the user address space to the pipe’s buffer.
Wakes up all processes sleeping on the pipe’s wait queue.
If the write operation was blocking and not all requested bytes were written in the pipe buffer, jumps back to Step 5. Notice that this case may occur only when the write operation is nonatomic; hence the current process remains blocked until one or more bytes of the pipe buffer are freed.
Releases the inode semaphore.
[122] Since most shells offer pipes that connect only two processes, applications requiring pipes used by more than two processes must be coded in a programming language such as C.
[123] As we’ll
see, the readers
and writers
fields act as counters instead of flags when associated with
FIFOs.
[124] Nonblocking operations are
usually requested by specifying the O_NONBLOCK
flag in the open( )
system call. This method does
not work for pipes, since they cannot be opened. A process can,
however, require a nonblocking operation on a pipe by issuing a
fcntl( )
system call on the corresponding file
descriptor.