Unix operating systems rely heavily on process creation to satisfy user requests. For example, the shell creates a new process that executes another copy of the shell whenever the user enters a command.
Traditional Unix systems treat all processes in the same way:
resources owned by the parent process are duplicated in the child
process. This approach makes process creation very slow and
inefficient, since it requires copying the entire address space of
the parent process. The child process rarely needs to read or modify
all the resources inherited from the parent; in many cases, it issues
an immediate execve( )
and wipes out the address
space that was so carefully copied.
Modern Unix kernels solve this problem by introducing three different mechanisms:
The Copy On Write technique allows both the parent and the child to read the same physical pages. Whenever either one tries to write on a physical page, the kernel copies its contents into a new physical page that is assigned to the writing process. The implementation of this technique in Linux is fully explained in Chapter 8.
Lightweight processes allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions.
The vfork( )
system call creates a process that
shares the memory address space of its parent. To prevent the parent
from overwriting data needed by the child, the parent’s execution is
blocked until the child exits or executes a new program.
We’ll learn more about the vfork( )
system call in the following section.
Lightweight processes are created in
Linux by using a function named clone( )
, which
uses four parameters:
fn
Specifies a function to be executed by the new process; when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.
arg
Points to data passed to the fn( )
function.
flags
Miscellaneous information. The low byte specifies the signal number
to be sent to the parent process when the child terminates; the
SIGCHLD
signal is generally selected. The
remaining three bytes encode a group of clone flags, which specify
the resources to be shared between the parent and the child process
as follows:
CLONE_VM
Shares the memory descriptor and all Page Tables (see Chapter 8).
CLONE_FS
Shares the table that identifies the root directory and the current working directory, as well as the value of the bitmask used to mask the initial file permissions of a new file (the so-called file umask ).
CLONE_FILES
Shares the table that identifies the open files (see Chapter 12).
CLONE_PARENT
Sets the parent of the child (p_pptr
and
p_opptr
fields in the process descriptor) to the
parent of the calling process.
CLONE_PID
Shares the PID.[22]
CLONE_PTRACE
If a ptrace( )
system call is causing the parent
process to be traced, the child will also be traced.
CLONE_SIGHAND
Shares the table that identifies the signal handlers (see Chapter 10).
CLONE_THREAD
Inserts the child into the same thread group of the parent, and the
child’s tgid
field is set
accordingly. If this flag is true, it implicitly enforces
CLONE_PARENT
.
CLONE_SIGNAL
Equivalent to setting both CLONE_SIGHAND
and
CLONE_THREAD
, so that it is possible to send a
signal to all threads of a multithreaded application.
CLONE_VFORK
Used for the vfork( )
system call (see later in
this section).
child_stack
Specifies the User Mode stack pointer to be assigned to the
esp
register of the child process. If it is equal
to 0, the kernel assigns the current parent stack pointer to the
child. Therefore, the parent and child temporarily share the same
User Mode stack. But thanks to the Copy On Write mechanism, they
usually get separate copies of the User Mode stack as soon as one
tries to change the stack. However, this parameter must have a
non-null value if the child process shares the same address space as
the parent.
clone( )
is actually a wrapper function defined in
the C library (see Section 9.1), which
in turn uses a clone( )
system call hidden to the
programmer. This system call receives only the
flags
and child_stack
parameters; the new process always starts its execution from the
instruction following the system call invocation. When the system
call returns to the clone( )
function, it
determines whether it is in the parent or the child and forces the
child to execute the fn( )
function.
The traditional fork( )
system call is implemented
by Linux as a clone( )
system call whose
flags
parameter specifies both a
SIGCHLD
signal and all the clone flags cleared,
and whose child_stack
parameter is 0.
The vfork( )
system call, described in the
previous section, is implemented by Linux as a clone( )
system call whose first parameter specifies both a
SIGCHLD
signal and the flags
CLONE_VM
and CLONE_VFORK
, and
whose second parameter is equal to 0.
When either a clone( )
, fork( )
, or vfork( )
system call is issued,
the kernel invokes the do_fork( )
function, which
executes the following steps:
If the CLONE_PID
flag is specified, the
do_fork( )
function checks whether the PID of the
parent process is not 0; if so, it returns an error code. Only the
swapper process is allowed to set
CLONE_PID
; this is required when initializing a
multiprocessor system.
The alloc_task_struct( )
function is invoked to
get a new 8 KB union
task_union
memory area to store the process descriptor and the Kernel Mode stack
of the new process.
The function follows the current
pointer to obtain
the parent process descriptor and copies it into the new process
descriptor in the memory area just allocated.
A few checks occur to make sure the user has the resources necessary
to start a new process. First, the function checks whether
current->rlim[RLIMIT_NPROC]
.rlim_cur
is smaller than or equal to the current number of processes owned by
the user. If so, an error code is returned, unless the process has
root privileges. The function gets the current number of processes
owned by the user from a per-user data structure named
user_struct
. This data structure can be found
through a pointer in the user
field of the process
descriptor.
The function checks that the number of processes is smaller than the
value of the max_threads
variable. The initial
value of this variable depends on the amount of RAM in the system.
The general rule is that the space taken by all process descriptors
and Kernel Mode stacks cannot exceed 1/8 of the physical memory.
However, the system administrator may change this value by writing in
the /proc/sys/kernel/threads-max
file.
If the parent process uses any kernel modules, the function increments the corresponding reference counters. As we shall see in Appendix B, each kernel module has its own reference counter, which ensures that the module will not be unloaded while it is being used.
The function then updates some of the flags included in the
flags
field that have been copied from the parent
process:
It clears the PF_SUPERPRIV
flag, which indicates
whether the process has used any of its superuser privileges.
It clears the PF_USEDFPU
flag.
It sets the PF_FORKNOEXEC
flag, which indicates
that the child process has not yet issued an execve( )
system call.
Now the function has taken almost everything that it can use from the
parent process; the rest of its activities focus on setting up new
resources in the child and letting the kernel know that this new
process has been born. First, the function invokes the
get_pid( )
function to obtain a new PID, which
will be assigned to the child process (unless the
CLONE_PID
flag is set).
The function then updates all the process descriptor fields that cannot be inherited from the parent process, such as the fields that specify the process parenthood relationships.
Unless specified differently by the flags
parameter, it invokes copy_files( )
,
copy_fs( )
, copy_sighand( )
,
and copy_mm( )
to create new data structures and
copy into them the values of the corresponding parent process data
structures.
The do_fork( )
function invokes
copy_thread( )
to initialize the Kernel Mode stack
of the child process with the values contained in the CPU registers
when the clone( )
call was issued (these values
have been saved in the Kernel Mode stack of the parent, as described
in Chapter 9). However, the function forces the
value 0 into the field corresponding to the eax
register. The thread.esp
field in the descriptor
of the child process is initialized with the base address of the
child’s Kernel Mode stack, and the address of an
assembly language function (ret_from_fork( )
) is
stored in the thread.eip
field. The
copy_thread( )
function also invokes
unlazy_fpu( )
on the parent and duplicates the
contents of the thread.i387
field.
If either CLONE_THREAD
or
CLONE_PARENT
is set, the function copies the value
of the p_opptr
and p_pptr
fields of the parent into the corresponding fields of the child. The
parent of the child thus appears as the parent of the current
process. Otherwise, the function stores the process descriptor
address of current
into the
p_opptr
and p_pptr
fields of
the child.
If the CLONE_PTRACE
flag is not set, the function
sets the ptrace
field in the child process
descriptor to 0. This field stores a few flags used when a process is
being traced by another process. Even if the current process is being
traced, the child will not.
Conversely, if the CLONE_PTRACE
flag is set, the
function checks whether the parent process is being traced because in
this case, the child should be traced too. Therefore, if
PT_PTRACED
is set in
current->ptrace
, the function copies the
current->p_pptr
field into the corresponding
field of the child.
The do_fork( )
function checks the value of
CLONE_THREAD
. If the flag is set, the function
inserts the child in the thread group of the parent and copies in the
tgid
field the value of the
parent’s tgid
; otherwise, the
function sets the tgid
field to the value of the
pid
field.
The function uses the SET_LINKS
macro to insert
the new process descriptor in the process list.
The function invokes hash_pid( )
to insert the new
process descriptor in the pidhash
hash table.
The function increments the values of nr_threads
and current->user->processes
.
If the child is being traced, the function sends a
SIGSTOP
signal to it so that the debugger has a
chance to look at it before it starts the execution.
It invokes wake_up_process( )
to set the
state
field of the child process descriptor to
TASK_RUNNING
and to insert the child in the
runqueue list.
If the CLONE_VFORK
flag is specified, the function
inserts the parent process in a wait queue and suspends it until the
child releases its memory address space (that is, until the child
either terminates or executes a new program).
The do_fork( )
function returns the PID of the
child, which is eventually read by the parent process in User Mode.
Now we have a complete child process in the runnable state. But it
isn’t actually running. It is up to the scheduler to
decide when to give the CPU to this child. At some future process
switch, the schedule bestows this favor on the child process by
loading a few CPU registers with the values of the
thread
field of the child’s
process descriptor. In particular, esp
is loaded
with thread.esp
(that is, with the address of
child’s Kernel Mode stack), and
eip
is loaded with the address of
ret_from_fork( )
. This assembly language function,
in turn, invokes the ret_from_sys_call( )
function
(see Chapter 9), which reloads all other registers
with the values stored in the stack and forces the CPU back to User
Mode. The new process then starts its execution right at the end of
the fork( )
, vfork( )
, or
clone( )
system call. The value returned by the
system call is contained in eax
: the value is 0
for the child and equal to the PID for the child’s
parent.
The child process executes the same code as the parent, except that the fork returns a 0. The developer of the application can exploit this fact, in a manner familiar to Unix programmers, by inserting a conditional statement in the program based on the PID value that forces the child to behave differently from the parent process.
Traditional Unix systems delegate some critical tasks to intermittently running processes, including flushing disk caches, swapping out unused page frames, servicing network connections, and so on. Indeed, it is not efficient to perform these tasks in strict linear fashion; both their functions and the end user processes get better responses if they are scheduled in the background. Since some of the system processes run only in Kernel Mode, modern operating systems delegate their functions to kernel threads, which are not encumbered with the unnecessary User Mode context. In Linux, kernel threads differ from regular processes in the following ways:
Each kernel thread executes a single specific kernel C function, while regular processes execute kernel functions only through system calls.
Kernel threads run only in Kernel Mode, while regular processes run alternatively in Kernel Mode and in User Mode.
Since kernel threads run only in Kernel Mode, they use only linear
addresses greater than PAGE_OFFSET
. Regular
processes, on the other hand, use all four gigabytes of linear
addresses, in either User Mode or Kernel Mode.
The kernel_thread( )
function creates a new kernel
thread and can be executed only by another kernel thread. The
function contains mostly inline assembly language code, but it is
roughly equivalent to the following:
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) { int p; p = clone( 0, flags | CLONE_VM ); if ( p ) /* parent */ return p; else { /* child */ fn(arg); exit( ); } }
The ancestor of all processes, called process 0
or,
for historical reasons, the swapper process
,
is a kernel thread created from scratch during the initialization
phase of Linux by the start_kernel( )
function
(see Appendix A). This ancestor process uses the
following data structures:
A process descriptor and a Kernel Mode stack stored in the
init_task_union
variable. The
init_task
and init_stack
macros
yield the addresses of the process descriptor and the stack,
respectively.
The following tables, which the process descriptor points to:
init_mm
init_fs
init_files
init_signals
The tables are initialized, respectively, by the following macros:
INIT_MM
INIT_FS
INIT_FILES
INIT_SIGNALS
The master kernel Page Global Directory stored in
swapper_pg_dir
(see Section 2.5.5).
The start_kernel( )
function initializes all the
data structures needed by the kernel, enables interrupts, and creates
another kernel thread, named process 1
(more
commonly referred to as the init process
):
kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
The newly created kernel thread has PID 1 and shares all per-process
kernel data structures with process 0. Moreover, when selected from
the scheduler, the init process starts executing
the init( )
function.
After having created the init process, process 0
executes the cpu_idle( )
function, which
essentially consists of repeatedly executing the
hlt
assembly language instruction with the
interrupts enabled (see Chapter 4). Process 0 is
selected by the scheduler only when there are no other processes in
the TASK_RUNNING
state.
The kernel thread created by process 0 executes the init( )
function, which in turn completes the initialization of
the kernel. Then init( )
invokes the
execve( )
system call to load the executable
program init. As a result, the
init kernel thread becomes a regular process
having its own per-process kernel data structure (see Chapter 20). The init process stays
alive until the system is shut down, since it creates and monitors
the activity of all processes that implement the outer layers of the
operating system.
Linux uses many other kernel threads. Some of them are created in the initialization phase and run until shutdown; others are created “on demand,” when the kernel must execute a task that is better performed in its own execution context.
The most important kernel threads (beside process 0 and process 1) are:
Executes the tasks in the qt_context
task queue
(see Section 4.7.3).
Handles the events related to the Advanced Power Management (APM).
Performs memory reclaiming, as described in Section 16.7.7.
Flushes “dirty” buffers to disk to reclaim memory, as described in Section 14.2.4.
Flushes old “dirty” buffers to disk to reduce risks of filesystem inconsistencies, as described in Section 14.2.4.
Runs the tasklets (see section Section 4.7); there is one kernel thread for each CPU in the system.
[22] As we shall see later, the
CLONE_PID
flag can be used only by a process
having a PID of 0; in a uniprocessor system, no two lightweight
processes have the same PID.