To manage processes, the
kernel must have a clear picture of
what each process is doing. It must know, for instance, the
process’s priority, whether it is running on a CPU
or blocked on an event, what address space has been assigned to it,
which files it is allowed to address, and so on. This is the role of
the process descriptor
— a
task_struct
type structure whose fields contain
all the information related to a single process. As the repository of
so much information, the process descriptor is rather complex. In
addition to a large number of fields containing process attributes,
the process descriptor contains several pointers to other data
structures that, in turn, contain pointers to other structures. Figure 3-1 describes the Linux process descriptor
schematically.
The five data structures on the right side of the figure refer to specific resources owned by the process. These resources are covered in future chapters. This chapter focuses on two types of fields that refer to the process state and to process parent/child relationships.
As its name implies, the state
field of the
process descriptor describes what is currently happening to the
process. It consists of an array of flags, each of which describes a
possible process state. In the current Linux version, these states
are mutually exclusive, and hence exactly one flag of
state
is set; the remaining flags are cleared. The
following are the possible process states:
TASK_RUNNING
The process is either executing on a CPU or waiting to be executed.
TASK_INTERRUPTIBLE
The process is suspended (sleeping) until some condition becomes
true. Raising a hardware interrupt, releasing a system resource the
process is waiting for, or delivering a signal are examples of
conditions that might wake up the process (put its state back to
TASK_RUNNING
).
TASK_UNINTERRUPTIBLE
Like the previous state, except that delivering a signal to the sleeping process leaves its state unchanged. This process state is seldom used. It is valuable, however, under certain specific conditions in which a process must wait until a given event occurs without being interrupted. For instance, this state may be used when a process opens a device file and the corresponding device driver starts probing for a corresponding hardware device. The device driver must not be interrupted until the probing is complete, or the hardware device could be left in an unpredictable state.
TASK_STOPPED
Process execution has been stopped; the process enters this state
after receiving a SIGSTOP
,
SIGTSTP
, SIGTTIN
, or
SIGTTOU
signal. When a process is being monitored
by another (such as when a debugger executes a ptrace( )
system call to monitor a test program), each signal may
put the process in the TASK_STOPPED
state.
TASK_ZOMBIE
Process execution is terminated, but the parent process has not yet
issued a wait( )
-like system call —
wait( )
, wait3( )
,
wait4( )
, or waitpid( )
—
to return information about the dead process. Before the
wait( )
-like call is issued, the kernel cannot
discard the data contained in the dead process descriptor because the
parent might need it. (See Section 3.5.2 near the end of this
chapter.)
The value of the state
field is usually set with a
simple assignment. For instance:
procdesc_ptr->state = TASK_RUNNING;
The kernel also uses the set_task_state
and
set_current_state
macros: they set the state of a
specified process and of the process currently executed,
respectively. Moreover, these macros ensure that the assignment
operation is not mixed with other instructions by the compiler or the
CPU control unit. Mixing the instruction order may sometimes lead to
catastrophic results (see Chapter 5).
As a general rule, each execution context that can be independently
scheduled must have its own process descriptor; therefore, even
lightweight processes, which share a large portion of their kernel
data structures, have their own task_struct
structures.
The strict one-to-one correspondence between the process and process descriptor makes the 32-bit process descriptor address[17] a useful means for the kernel to identify processes. These addresses are referred to as process descriptor pointers . Most of the references to processes that the kernel makes are through process descriptor pointers.
On the other hand, Unix-like operating systems allow users to
identify processes by means of a number called the Process ID
(or PID), which is
stored in the pid
field of the process descriptor.
PIDs are numbered sequentially: the PID of a newly created process is
normally the PID of the previously created process incremented by
one. However, for compatibility with traditional Unix systems
developed for 16-bit hardware platforms, the maximum PID number
allowed on Linux is 32,767. When the kernel creates the 32,768th
process in the system, it must start recycling the lower, unused
PIDs.
Linux associates a different PID with each process or lightweight process in the system. (As we shall see later in this chapter, there is a tiny exception on multiprocessor systems.) This approach allows the maximum flexibility, since every execution context in the system can be uniquely identified.
On the other hand, Unix programmers expect threads in the same group to have a common PID. For instance, it should be possible to a send a signal specifying a PID that affects all threads in the group. In fact, the POSIX 1003.1c standard states that all threads of a multithreaded application must have the same PID.
To comply with this standard, Linux 2.4 introduces the notion of
thread group
. A thread group is essentially a
collection of lightweight processes that correspond to the threads of
a multithreaded application. All descriptors of the lightweight
processes in the same thread group are collected in a doubly linked
list implemented through the thread_group
field of
the task_struct
structure. The identifier shared
by the threads is the PID of the first lightweight process in the
group; it is stored in the tgid
field of the
process descriptors. The getpid( )
system call
returns current->tgid
instead of
current->pid
, so all the threads of a
multithreaded application share the same identifier. The
tgid
field has the same value as the
pid
field, both for normal processes and for
lightweight processes not included in a thread group. Therefore, the
getpid( )
system call works as usual for them.
Later, we’ll show you how it is possible to derive a
true process descriptor pointer efficiently from its respective PID.
Efficiency is important because many system calls such as
kill( )
use the PID to denote the affected
process.
Processes are dynamic entities whose lifetimes range from a few milliseconds to months. Thus, the kernel must be able to handle many processes at the same time, and process descriptors are stored in dynamic memory rather than in the memory area permanently assigned to the kernel. Linux stores two different data structures for each process in a single 8 KB memory area: the process descriptor and the Kernel Mode process stack.
In Section 2.3, we learned that a process in Kernel Mode accesses a stack contained in the kernel data segment, which is different from the stack used by the process in User Mode. Since kernel control paths make little use of the stack, only a few thousand bytes of kernel stack are required. Therefore, 8 KB is ample space for the stack and the process descriptor.
Figure 3-2 shows how the two data structures are stored in the 2-page (8 KB) memory area. The process descriptor resides at the beginning of the memory area and the stack grows downward from the end.
The esp
register is the CPU stack pointer,
which is used to address the stack’s top location.
On Intel systems, the stack starts at the end and grows toward the
beginning of the memory area. Right after switching from User Mode to
Kernel Mode, the kernel stack of a process is always empty, and
therefore the esp
register points to the byte
immediately following the memory area.
The value of the esp
is decremented as soon as
data is written into the stack. Since the process descriptor is less
than 1,000 bytes long, the kernel stack can expand up to 7,200 bytes.
The C language allows the process descriptor and the kernel stack of a process to be conveniently represented by means of the following union construct:
union task_union { struct task_struct task; unsigned long stack[2048]; };
The process descriptor shown in Figure 3-2 is
stored starting at address 0x015fa000
, and the
stack is stored starting at address 0x015fc000
.
The value of the esp
register points to the
current top of the stack at 0x015fa878
.
The kernel uses the alloc_task_struct
and
free_task_struct
macros to allocate and release
the 8 KB memory area storing a process descriptor and a kernel stack.
The close
association between the process descriptor and the Kernel Mode stack
just described offers a key benefit in terms of efficiency: the
kernel can easily obtain the process descriptor pointer of the
process currently running on a CPU from the value of the
esp
register. In fact, since the memory area is 8
KB (213 bytes) long, all the kernel has to
do is mask out the 13 least significant bits of
esp
to obtain the base address of the process
descriptor. This is done by the current
macro,
which produces assembly language instructions like the following:
movl $0xffffe000, %ecx andl %esp, %ecx movl %ecx, p
After executing these three instructions, p
contains the process descriptor pointer of the process running on the
CPU that executes the instruction.[18]
The current
macro often appears in kernel code as
a prefix to fields of the process descriptor. For example,
current->pid
returns the process ID of the
process currently running on the CPU.
Another advantage of storing the process descriptor with the stack
emerges on multiprocessor systems: the correct current process for
each hardware processor can be derived just by checking the stack, as
shown previously. Linux 2.0 did not store the kernel stack and the
process descriptor together. Instead, it was forced to introduce a
global static variable called current
to identify
the process descriptor of the running process. On multiprocessor
systems, it was necessary to define current
as an
array—one element for each available CPU.
To allow an efficient search through processes of a given type (for
instance, all processes in a runnable state), the kernel creates
several lists of processes. Each list consists of pointers to process
descriptors. A list pointer (that is, the field that each process
uses to point to the next process) is embedded right in the process
descriptor’s data structure. When you look at the
C-language declaration of the task_struct
structure, the descriptors may seem to turn in on themselves in a
complicated recursive manner. However, the concept is no more
complicated than any list, which is a data structure containing a
pointer to the next instance of itself.
A circular doubly linked list (see Figure 3-3)
links all existing process descriptors; we will call it the
process list
. The prev_task
and
next_task
fields of each process descriptor are
used to implement the list. The head of the list is the
init_task
descriptor; it is the ancestor of
all processes, and is called process 0
or
swapper (see Section 3.4.2 later in this chapter). The
prev_task
field of init_task
points to the process descriptor inserted last in the list.
The SET_LINKS
and REMOVE_LINKS
macros are used to insert and to remove a process descriptor in the
process list, respectively. These macros also take care of the
parenthood relationship of the process (see Section 3.2.3 later in this chapter).
Another useful macro, called for_each_task
, scans
the whole process list. It is defined as:
#define for_each_task(p) for (p = &init_task ; (p = p->next_task) != &init_task ; )
The macro is the loop control statement after which the kernel
programmer supplies the loop. Notice how the
init_task
process descriptor just plays the role
of list header. The macro starts by moving past
init_task
to the next task and continues until it
reaches init_task
again (thanks to the circularity
of the list).
The process list is a special doubly linked list. However, as you may have noticed, the Linux kernel uses hundreds of doubly linked lists that store the various kernel data structures.
For each list, a set of primitive operations must be implemented: initializing the list, inserting and deleting an element, scanning the list, and so on. It would be both a waste of programmers’ efforts and a waste of memory to replicate the primitive operations for each different list.
Therefore, the Linux kernel defines the list_head
data structure, whose fields next
and
prev
represent the forward and back pointers of a
generic doubly linked list element, respectively. It is important to
note, however, that the pointers in a list_head
field store the addresses of other list_head
fields rather than the addresses of the whole data structures in
which the list_head
structure is included (see
Figure 3-4).
A new list is created by using the
LIST_HEAD(list_name)
macro. It declares a new
variable named list_name
of type
list_head
, which is the conventional first element
of the new list (much as init_task
is the
conventional first element of the process list).
Several functions and macros implement the primitives, including those shown in the following list.
list_add(n,p)
Inserts an element pointed by n
right after the
specified element pointed by p
(to insert
n
at the beginning of the list, set
p
to the address of the conventional first
element)
list_add_tail(n,h)
Inserts an element pointed by n
at the end of the
list specified by the address h
of its
conventional first element
list_del(p)
Deletes an element pointed by p
(there is no need
to specify the conventional first element of the list)
list_empty(p)
Checks if the list specified by the address of its conventional first element is empty
list_entry(p,t,f)
Returns the address of the data structure of type
t
in which the list_head
field
that has the name f
and the address
p
is included
list_for_each(p,h)
Scans the elements of the list specified by the address
h
of the conventional first element (similar to
for_each_task
for the process list)
When looking for a new process to run on the CPU, the kernel has to
consider only the runnable processes (that is, the processes in the
TASK_RUNNING
state). Since it is rather
inefficient to scan the whole process list, a doubly linked circular
list of TASK_RUNNING
processes called
runqueue
has been introduced. This list is implemented through the
run_list
field of type
list_head
in the process descriptor. As in the
previous case, the init_task
process descriptor
plays the role of list header. The nr_running
variable stores the total number of runnable processes.
The add_to_runqueue( )
function inserts a process
descriptor at the beginning of the list, while
del_from_runqueue( )
removes a process descriptor
from the list. For scheduling purposes, two functions,
move_first_runqueue( )
and
move_last_runqueue( )
, are provided to move a
process descriptor to the beginning or the end of the runqueue,
respectively. The task_on_runqueue( )
function
checks whether a given process is inserted into the runqueue.
Finally, the wake_up_process( )
function is used
to make a process runnable. It sets the process state to
TASK_RUNNING
and invokes add_to_runqueue( )
to insert the process in the runqueue list. It also
forces the invocation of the scheduler when the process has a dynamic
priority larger than that of the current process or, in SMP systems,
that of a process currently executing on some other CPU (see Chapter 11).
In several circumstances, the kernel must be able to derive the
process descriptor pointer corresponding to a PID. This occurs, for
instance, in servicing the kill( )
system call.
When process P1 wishes to send a signal to another process, P2, it
invokes the kill( )
system call specifying the PID
of P2 as the parameter. The kernel derives the process descriptor
pointer from the PID and then extracts the pointer to the data
structure that records the pending signals from P2’s
process descriptor.
Scanning the process list sequentially and checking the
pid
fields of the process descriptors is feasible
but rather inefficient. To speed up the search, a
pidhash
hash table consisting of
PIDHASH_SZ
elements has been introduced
(PIDHASH_SZ
is usually set to
1,024)
. The table entries contain process
descriptor pointers. The PID is transformed into a table index using
the pid_hashfn
macro:
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
As every basic computer science course explains, a hash function does not always ensure a one-to-one correspondence between PIDs and table indexes. Two different PIDs that hash into the same table index are said to be colliding .
Linux uses chaining
to handle colliding PIDs; each table
entry is a doubly linked list of colliding process descriptors. These
lists are implemented by means of the pidhash_next
and pidhash_pprev
fields in the process
descriptor. Figure 3-5 illustrates a
pidhash
table with two lists. The processes having
PIDs 199 and 26,799 hash into the 200th element of the table, while
the process having PID 26,800 hashes into the 217th element of the
table.
Hashing with chaining is preferable to a linear transformation from PIDs to table indexes because at any given instance, the number of processes in the system is usually far below 32,767 (the maximum allowed PID). It is a waste of storage to define a table consisting of 32,768 entries, if, at any given instance, most such entries are unused.
The hash_pid( )
and unhash_pid( )
functions are invoked to insert and remove a process in
the pidhash
table, respectively. The
find_task_by_pid( )
function searches the hash
table and returns the process descriptor pointer of the process with
a given PID (or a null pointer if it does not find the process).
Processes created by a program have a parent/child relationship. When a process creates multiple children, these children have sibling relationships. Several fields must be introduced in a process descriptor to represent these relationships. Processes 0 and 1 are created by the kernel; as we shall see later in the chapter, process 1 (init) is the ancestor of all other processes. The descriptor of a process P includes the following fields:
p_opptr
(original parent)Points to the process descriptor of the process that created P or to the descriptor of process 1 (init) if the parent process no longer exists. Therefore, when a shell user starts a background process and exits the shell, the background process becomes the child of init.
p_pptr
(parent)Points to the current parent of P (this is the process that must be
signaled when the child process terminates); its value usually
coincides with that of p_opptr
. It may
occasionally differ, such as when another process issues a
ptrace( )
system call requesting that it be
allowed to monitor P (see Section 20.1.5).
p_cptr
(child)Points to the process descriptor of the youngest child of P — that is, of the process created most recently by it.
p_ysptr
(younger sibling)Points to the process descriptor of the process that has been created immediately after P by P’s current parent.
p_osptr
(older sibling)Points to the process descriptor of the process that has been created immediately before P by P’s current parent.
Figure 3-6 illustrates the parent and sibling
relationships of a group of processes. Process P0 successively
created P1, P2, and P3. Process P3, in turn, created process P4.
Starting with p_cptr
and using the
p_osptr
pointers to siblings, P0 is able to
retrieve all its children.
The runqueue list groups all processes in a
TASK_RUNNING
state. When it comes to grouping
processes in other states, the various states call for different
types of treatment, with Linux opting for one of the choices shown in
the following list.
Processes in a TASK_STOPPED
or in a
TASK_ZOMBIE
state are not linked in specific
lists. There is no need to group processes in either of these two
states, since stopped and zombie processes are accessed only via PID
or via linked lists of the child processes for a particular parent.
Processes in a TASK_INTERRUPTIBLE
or
TASK_UNINTERRUPTIBLE
state are subdivided into
many classes, each of which corresponds to a specific event. In this
case, the process state does not provide enough information to
retrieve the process quickly, so it is necessary to introduce
additional lists of processes. These are called wait queues
.
Wait queues have several uses in the kernel, particularly for interrupt handling, process synchronization, and timing. Because these topics are discussed in later chapters, we’ll just say here that a process must often wait for some event to occur, such as for a disk operation to terminate, a system resource to be released, or a fixed interval of time to elapse. Wait queues implement conditional waits on events: a process wishing to wait for a specific event places itself in the proper wait queue and relinquishes control. Therefore, a wait queue represents a set of sleeping processes, which are woken up by the kernel when some condition becomes true.
Wait queues are implemented as doubly
linked lists whose elements include pointers to process descriptors.
Each wait queue is identified by a wait queue head
, a data structure of type
wait_queue_head_t
:
struct _ _wait_queue_head { spinlock_t lock; struct list_head task_list; }; typedef struct _ _wait_queue_head wait_queue_head_t;
Since wait queues are modified by interrupt handlers as well by major
kernel functions, the doubly linked lists must be protected from
concurrent accesses, which could induce unpredictable results (see
Chapter 5). Synchronization is achieved by the
lock
spin lock in the wait queue head.
Elements of a wait queue list are of type
wait_queue_t
:
struct _ _wait_queue { unsigned int flags; struct task_struct * task; struct list_head task_list; }; typedef struct _ _wait_queue wait_queue_t;
Each element in the wait queue list
represents a sleeping process, which is waiting for some event to
occur; its descriptor address is stored in the
task
field. However, it is not always convenient
to wake up all sleeping processes in a wait
queue.
For instance, if two or more processes are waiting for exclusive access to some resource to be released, it makes sense to wake up just one process in the wait queue. This process takes the resource, while the other processes continue to sleep. (This avoids a problem known as the “thundering herd,” with which multiple processes are awoken only to race for a resource that can be accessed by one of them, and the result is that remaining processes must once more be put back to sleep.)
Thus, there are two kinds of sleeping processes: exclusive processes (denoted by the value 1 in the flags field of the corresponding wait queue element) are selectively woken up by the kernel, while nonexclusive processes (denoted by the value 0 in flags) are always woken up by the kernel when the event occurs. A process waiting for a resource that can be granted to just one process at a time is a typical exclusive process. Processes waiting for an event like the termination of a disk operation are nonexclusive.
The add_wait_queue( )
function inserts a
nonexclusive process in the first position of a wait queue list. The
add_wait_queue_exclusive( )
function inserts an
exclusive process in the last position of a wait queue list. The
remove_wait_queue( )
function removes a process
from a wait queue list. The waitqueue_active( )
function checks whether a given wait queue list is empty.
A new wait queue may be defined by using the
DECLARE_WAIT_QUEUE_HEAD(name)
macro, which
statically declares and initializes a new wait queue head variable
called name
. The init_waitqueue_head( )
function may be used to initialize a wait queue head
variable that was allocated dynamically.
A process wishing to wait for a specific condition can invoke any of the functions shown in the following list.
The sleep_on( )
function operates on the current
process:
void sleep_on(wait_queue_head_t *q) { unsigned long flags; wait_queue_t wait; wait.flags = 0; wait.task = current; current->state = TASK_UNINTERRUPTIBLE; add_wait_queue(q, &wait); schedule( ); remove_wait_queue(q, &wait); }
The function sets the state of the current process to
TASK_UNINTERRUPTIBLE
and inserts it into the
specified wait queue. Then it invokes the scheduler, which resumes
the execution of another process. When the sleeping process is woken,
the scheduler resumes execution of the sleep_on( )
function, which removes the process from the wait queue.
The interruptible_sleep_on( )
is identical to
sleep_on( )
, except that it sets the state of the
current process to TASK_INTERRUPTIBLE
instead of
setting it to TASK_UNINTERRUPTIBLE
so that the
process can also be woken up by receiving a signal.
The sleep_on_timeout( )
and
interruptible_sleep_on_timeout( )
functions are
similar to the previous ones, but they also allow the caller to
define a time interval after which the process will be woken up by
the kernel. To do this, they invoke the schedule_timeout( )
function instead of schedule( )
(see
Section 6.6.2).
The wait_event
and
wait_event_interruptible
macros, introduced in
Linux 2.4, put the calling process to sleep on a wait queue until a
given condition is verified. For instance, the
wait_event_interruptible(wq,condition)
macro
essentially yields the following fragment (we have omitted the code
related to signal handling and return values on purpose):
if (!(condition)) { wait_queue_t _ _wait; init_waitqueue_entry(&_ _wait, current); add_wait_queue(&wq, &_ _wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (condition) break; schedule(); } current->state = TASK_RUNNING; remove_wait_queue(&wq, &_ _wait); }
These macros should be used instead of the older sleep_on( )
and interruptible_sleep_on( )
, because
the latter functions cannot test a condition and atomically put the
process to sleep when the condition is not verified and are thus a
well-known source of race conditions.
Notice that any process put to sleep by one of the above functions or
macros is nonexclusive. Whenever the kernel wants to insert an
exclusive process into a wait queue, it invokes
add_wait_queue_exclusive( )
directly.
Processes inserted in a wait queue enter the
TASK_RUNNING
state by means of one of the
following macros: wake_up
,
wake_up_nr
, wake_up_all
,
wake_up_sync
, wake_up_sync_nr
,
wake_up_interruptible
,
wake_up_interruptible_nr
,
wake_up_interruptible_all
,
wake_up_interruptible_sync
, and
wake_up_interruptible_sync_nr
. We can understand
what each of these ten macros does from its name:
All macros take into consideration sleeping processes in
TASK_INTERRUPTIBLE
state; if the macro name does
not include the string
“interruptible,” sleeping processes
in TASK_UNINTERRUPTIBLE
state are also considered.
All macros wake all nonexclusive processes having the required state (see the previous bullet item).
The macros whose name include the string “nr” wake a given number of exclusive processes having the required state; this number is a parameter of the macro. The macros whose name include the string “all” wake all exclusive processes having the required state. Finally, the macros whose names don’t include “nr” or “all” wake exactly one exclusive process that has the required state.
The macros whose names don’t include the string
“sync” check whether the priority
of the woken processes is higher than that of the processes currently
running in the systems and invoke schedule( )
if
necessary. These checks are not made by the macros whose names
include the string “sync.”
For instance, the wake_up
macro is equivalent to
the following code fragment:
void wake_up(wait_queue_head_t *q) { struct list_head *tmp; wait_queue_t *curr; list_for_each(tmp, &q->task_list) { curr = list_entry(tmp, wait_queue_t, task_list); wake_up_process(curr->task); if (curr->flags) break; } }
The list_for_each
macro scans all items in the
doubly linked list of q
. For each item, the
list_entry
macro computes the address of the
correspondent wait_queue_t
variable. The
task
field of this variable stores the pointer to
the process descriptor, which is then passed to the
wake_up_process( )
function. If the woken process
is exclusive, the loop terminates. Since all nonexclusive processes
are always at the beginning of the doubly linked list and all
exclusive processes are at the end, the function always waken the
nonexclusive processes and then wakes one exclusive process, if any
exists.[19] Notice that awoken processes are not removed from the
wait queue. A process could be awoken while the wait condition is
still false; in this case, the process may suspend itself again in
the same wait queue.
Each process has an associated set of resource limits , which specify the amount of system resources it can use. These limits keep a user from overwhelming the system (its CPU, disk space, and so on). Linux recognizes the following resource limits:
RLIMIT_AS
The maximum size of process address space, in bytes. The kernel
checks this value when the process uses malloc( )
or a related function to enlarge its address space (see
Section 8.1).
RLIMIT_CORE
The maximum core dump file size, in bytes. The kernel checks this
value when a process is aborted, before creating a
core
file in the current directory of the process
(see Section 10.1.1). If the limit is 0,
the kernel won’t create the file.
RLIMIT_CPU
The maximum CPU time for the process, in seconds. If the process
exceeds the limit, the kernel sends it a SIGXCPU
signal, and then, if the process doesn’t terminate,
a SIGKILL
signal (see Chapter 10).
RLIMIT_DATA
The maximum heap size, in bytes. The kernel checks this value before expanding the heap of the process (see Section 8.6).
RLIMIT_FSIZE
The maximum file size allowed, in bytes. If the process tries to
enlarge a file to a size greater than this value, the kernel sends it
a SIGXFSZ
signal.
RLIMIT_LOCKS
The maximum number of file locks. The kernel checks this value when the process enforces a lock on a file (see Section 12.7).
RLIMIT_MEMLOCK
The maximum size of nonswappable memory, in bytes. The kernel checks
this value when the process tries to lock a page frame in memory
using the mlock( )
or mlockall( )
system calls (see Section 8.3.4).
RLIMIT_NOFILE
The maximum number of open file descriptors. The kernel checks this value when opening a new file or duplicating a file descriptor (see Chapter 12).
RLIMIT_NPROC
The maximum number of processes that the user can own (see Section 3.4.1 later in this chapter).
RLIMIT_RSS
The maximum number of page frames owned by the process. The kernel
checks this value when the process uses malloc( )
or a related function to enlarge its address space (see
Section 8.1).
RLIMIT_STACK
The maximum stack size, in bytes. The kernel checks this value before expanding the User Mode stack of the process (see Section 8.4).
The resource limits are stored in the rlim
field
of the process descriptor. The field is an array of elements of type
struct rlimit
, one for each resource limit:
struct rlimit { unsigned long rlim_cur; unsigned long rlim_max; };
The rlim_cur
field is the current resource limit
for the resource. For example,
current->rlim[RLIMIT_CPU].rlim_cur
represents
the current limit on the CPU time of the running process.
The rlim_max
field is the maximum allowed value
for the resource limit. By using the getrlimit( )
and setrlimit( )
system calls, a user can always
increase the rlim_cur
limit of some resource up to
rlim_max
. However, only the superuser (or, more
precisely, a user who has the CAP_SYS_RESOURCE
capability) can increase the rlim_max
field or set
the rlim_cur
field to a value greater than the
corresponding rlim_max
field.
Most resource limits contain the value
RLIM_INFINITY
(0xffffffff
),
which means that no user limit is imposed on the corresponding
resource (of course, real limits exist due to kernel design
restrictions, available RAM, available space on disk, etc.). However,
the system administrator may choose to impose stronger limits on some
resources. Whenever a user logs into the system, the kernel creates a
process owned by the superuser, which can invoke setrlimit( )
to decrease the rlim_max
and
rlim_cur
fields for a resource. The same process
later executes a login shell and becomes owned by the user. Each new
process created by the user inherits the content of the
rlim
array from its parent, and therefore the user
cannot override the limits enforced by the system.
[17] Technically, these 32 bits are only the offset component of a logical address. However, since Linux uses a single kernel data segment, we can consider the offset to be equivalent to a whole logical address. Furthermore, since the base addresses of the code and data segments are set to 0, we can treat the offset as a linear address.
[18] One drawback to the shared-storage approach is that, for efficiency reasons, the kernel stores the 8-KB memory area in two consecutive page frames with the first page frame aligned to a multiple of 213. This may turn out to be a problem when little dynamic memory is available.
[19] By the way, it is rather uncommon that a wait queue includes both exclusive and nonexclusive processes.