The Linux scheduling algorithm works by dividing the CPU time into epochs . In a single epoch, every process has a specified time quantum whose duration is computed when the epoch begins. In general, different processes have different time quantum durations. The time quantum value is the maximum CPU time portion assigned to the process in that epoch. When a process has exhausted its time quantum, it is preempted and replaced by another runnable process. Of course, a process can be selected several times from the scheduler in the same epoch, as long as its quantum has not been exhausted—for instance, if it suspends itself to wait for I/O, it preserves some of its time quantum and can be selected again during the same epoch. The epoch ends when all runnable processes have exhausted their quanta; in this case, the scheduler algorithm recomputes the time-quantum durations of all processes and a new epoch begins.
Each process has a base time quantum
, which is the time-quantum value
assigned by the scheduler to the process if it has exhausted its
quantum in the previous epoch. The users can change the base time
quantum of their processes by using the nice( )
and setpriority( )
system calls (see
Section 11.3 later in this
chapter). A new process always inherits the base time quantum of its
parent.
The INIT_TASK
macro sets the value of the initial
time quantum of process 0 (swapper) to
DEF_COUNTER
; that macro is defined as follows:
#define DEF_COUNTER ( 10 * HZ / 100)
Since HZ
(which denotes the frequency of timer
interrupts) is set to 100 for IBM compatible PCs (see
Section 6.1.3), the value of
DEF_COUNTER
is 10 ticks—that is, about 105
ms.
To select a process to run, the Linux scheduler must consider the priority of each process. Actually, there are two kinds of priorities:
This is assigned by the users to real-time processes and ranges from 1 to 99. It is never changed by the scheduler.
This applies only to conventional processes; it is essentially the sum of the base time quantum (which is therefore also called the base priority of the process) and of the number of ticks of CPU time left to the process before its quantum expires in the current epoch.
Of course, the static priority of a real-time process is always
higher than the dynamic priority of a conventional one. The scheduler
starts running conventional processes only when there is no real-time
process in a TASK_RUNNING
state.
There is always at least one runnable process: the swapper kernel thread, which has PID 0 and executes only when the CPU cannot execute other processes. As mentioned in Chapter 3, every CPU of a multiprocessor system has its own kernel thread with PID equal to 0.
Recall from Section 3.2 that the process list links all process
descriptors, while the runqueue list links the process descriptors of
all runnable processes—that is, of those in a
TASK_RUNNING
state. In both cases, the
init_task
process descriptor plays the role of
list header.
Each process descriptor includes several fields related to scheduling:
need_resched
A flag checked by ret_from_sys_call( )
to decide
whether to invoke the schedule( )
function (see
Section 4.8.3).[74]
policy
The scheduling class. The values permitted are:
SCHED_FIFO
A First-In, First-Out real-time process. When the scheduler assigns the CPU to the process, it leaves the process descriptor in its current position in the runqueue list. If no other higher-priority real-time process is runnable, the process continues to use the CPU as long as it wishes, even if other real-time processes that have the same priority are runnable.
SCHED_RR
A Round Robin real-time process. When the scheduler assigns the CPU
to the process, it puts the process descriptor at the end of the
runqueue list. This policy ensures a fair assignment of CPU time to
all SCHED_RR
real-time processes that have the
same priority.
SCHED_OTHER
A conventional, time-shared process.
The policy
field also encodes a
SCHED_YIELD
binary flag. This flag is set when the
process invokes the sched_yield( )
system call (a
way of voluntarily relinquishing the processor without the need to
start an I/O operation or go to sleep; see the later section Section 11.3). The kernel also sets the
SCHED_YIELD
flag and invokes the
schedule( )
function whenever it is executing a
long noncritical task and wishes to give other processes a chance to
run.
rt_priority
The static priority of a real-time process; valid priorities range between 1 and 99. The static priority of a conventional process must be set to 0.
counter
The number of ticks of CPU time left to the process before its
quantum expires; when a new epoch begins, this field contains the
time-quantum duration of the process. Recall that the
update_process_times( )
function decrements the
counter
field of the current process by 1 at every
tick.
nice
Determines the length of the process time quantum when a new epoch begins. This field contains values ranging between - 20 and + 19; negative values correspond to “high priority” processes, positive ones to “low priority” processes. The default value 0 corresponds to normal processes.
cpus_allowed
A bit mask specifying the CPUs on which the process is allowed to run. In the 80 × 86 architecture, the maximum number of processor is set to 32, so the whole mask can be encoded in a single integer field.
cpus_runnable
A bit mask specifying the CPU that is executing the process, if any.
If the process is not executed by any CPU, all bits of the field are
set to 1. Otherwise, all bits of the field are set to 0, except the
bit associated with the executing CPU, which is set to 1. This
encoding allows the kernel to verify whether the process can be
scheduled on a given CPU by simply computing the logical AND between
this field, the cpus_allowed
field, and the bit
mask specifying the CPU.
processor
The index of the CPU that is executing the process, if any; otherwise, the index of the last CPU that executed the process.
When a new process is created, do_fork( )
sets the
counter
field of both current
(the parent) and p
(the child) processes in the
following way:
p->counter = (current->counter + 1) >> 1; current->counter >>= 1; if (!current->counter) current->need_resched = 1;
In other words, the number of ticks left to the parent is split in two halves: one for the parent and one for the child. This is done to prevent users from getting an unlimited amount of CPU time by using the following method: the parent process creates a child process that runs the same code and then kills itself; by properly adjusting the creation rate, the child process would always get a fresh quantum before the quantum of its parent expires. This programming trick does not work since the kernel does not reward forks. Similarly, a user cannot hog an unfair share of the processor by starting lots of background processes in a shell or by opening a lot of windows on a graphical desktop. More generally speaking, a process cannot hog resources (unless it has privileges to give itself a real-time policy) by forking multiple descendents.
Besides the fields included in each process descriptor, additional
information is needed to describe what each CPU is doing. To that
end, the scheduler can rely on the aligned_data
array of NR_CPUS
structures of type
schedule_data
. Each such structure consists of two
fields:
curr
A pointer to the process descriptor of the process running on that
CPU. The field is usually accessed by means of the
cpu_curr(n)
macro, where n
is
the CPU logical number.
last_schedule
The value of the 64-bit Time Stamp Counter when the last process
switch was performed on the CPU. The field is usually accessed by
means of the last_schedule(n)
macro, where
n
is the CPU logical number.
Most of the time, any CPU accesses only its own array element; it is
thus convenient to align the entries of the
aligned_data
array so that every element falls in
a different cache line. In this way, the CPUs have a better chance to
find their own element in the hardware cache.
The schedule( )
function implements the scheduler. Its objective is to find a process
in the runqueue list and then assign the CPU to it. It is invoked,
directly or in a lazy (deferred) way, by several kernel routines.
The scheduler is invoked directly
when the current
process must be blocked right
away because the resource it needs is not available. In this case,
the kernel routine that wants to block it proceeds as follows:
Inserts current
in the proper wait queue
Changes the state of current
either to
TASK_INTERRUPTIBLE
or to
TASK_UNINTERRUPTIBLE
Invokes schedule( )
Checks whether the resource is available; if not, goes to Step 2
Once the resource is available, removes current
from the wait queue
As can be seen, the kernel routine checks repeatedly whether the
resource needed by the process is available; if not, it yields the
CPU to some other process by invoking schedule( )
.
Later, when the scheduler once again grants the CPU to the process,
the availability of the resource is rechecked. These steps are
similar to those performed by the sleep_on( )
and
interruptible_sleep_on( )
functions described in
Section 3.2.4.
The scheduler is also directly invoked by many device drivers that
execute long iterative tasks. At each iteration cycle, the driver
checks the value of the need_resched
field and, if
necessary, invokes schedule( )
to voluntarily
relinquish the CPU.
The scheduler can also be invoked
in a lazy way by setting the need_resched
field of
current
to 1. Since a check on the value of this
field is always made before resuming the execution of a User Mode
process (see Section 4.8),
schedule( )
will definitely be invoked at some
time in the near future.
For instance, lazy invocation of the scheduler is performed in the following cases:
When current
has used up its quantum of CPU time;
this is done by the update_process_times( )
function.
When a process is woken up and its priority is higher than that of
the current process; this task is performed by the
reschedule_idle( )
function, which is usually
invoked by the wake_up_process( )
function (see
Section 3.2.2).
When a sched_setscheduler( )
or sched_yield( )
system call is issued (see Section 11.3 later in this chapter).
The goal of the schedule( )
function consists of replacing the currently executing
process with another one. Thus, the key outcome of the function is to
set a local variable called next
so that it points
to the descriptor of the process selected to replace
current
. If no runnable process in the system has
priority greater than the priority of current
, at
the end, next
coincides with
current
and no process switch takes place.
For efficiency reasons, the schedule( )
function
starts by initializing a few local variables:
prev = current; this_cpu = prev->processor; sched_data = & aligned_data[this_cpu];
As you see, the pointer returned by current
is
saved in prev
, the logical number of the executing
CPU is saved in this_cpu
, and the pointer to the
aligned_data
array element of the CPU is saved in
sched_data
.
Next, schedule( )
makes sure that
prev
doesn’t hold the global
kernel lock or the global interrupt lock (see Section 5.5.2 and Section 5.3.10), and then reenables the local interrupts:
if (prev->lock_depth >= 0) spin_unlock(&kernel_flag); release_irqlock(this_cpu); _ _sti( );
Generally speaking, a process should never hold a lock across a
process switch; otherwise, the system freezes as soon as another
process tries to acquire the same lock.However, notice that
schedule( )
doesn’t change the
value of the lock_depth
field; when
prev
resumes its execution, it reacquires the
kernel_flag
spin lock if the value of this field
is not negative. Thus, the global kernel lock is automatically
released and reacquired across a process switch. Conversely, the
global interrupt lock is not automatically reacquired.
Before starting to look at the runnable processes, schedule( )
must disable the local interrupts and acquire the spin
lock that protects the runqueue (see section Section 3.2.2.5):
spin_lock_irq(&runqueue_lock);
A check is then made to determine whether prev
is
a Round Robin real-time process (policy
field set
to SCHED_RR
) that has exhausted its quantum. If
so, schedule( )
assigns a new quantum to
prev
and puts it at the bottom of the runqueue
list:
if (prev->policy == SCHED_RR && !prev->counter) { prev->counter = (20 - prev->nice) / 4 + 1; move_last_runqueue(prev); }
Recall that the nice
field of a process ranges
between - 20 and + 19; therefore, schedule( )
replenishes the counter
field with a number of
ticks ranging from 11 to 1. The default value of the
nice
field is 0, so usually the process gets a new
quantum of 6 ticks, roughly 60 ms.[75]
Next, schedule( )
examines the state of
prev
. If it has nonblocked pending signals and its
state is TASK_INTERRUPTIBLE
, the function sets the
process state to TASK_RUNNING
. This action is not
the same as assigning the processor to prev
; it
just gives prev
a chance to be selected for
execution:
if (prev->state == TASK_INTERRUPTIBLE && signal_pending(prev)) prev->state = TASK_RUNNING;
If prev
is not in the
TASK_RUNNING
state, schedule( )
was directly invoked by the process itself because it had to wait on
some external resource; therefore, prev
must be
removed from the runqueue list:
if (prev->state != TASK_RUNNING) del_from_runqueue(prev);
The function also resets the need_resched
field of
prev
, just in case the scheduler was activated
in the lazy way:
prev->need_resched = 0;
Now the time has come for schedule( )
to select
the process to be executed in the next time quantum. To that end, the
function scans the runqueue list. The objective is to store in
next
the process descriptor pointer of the highest
priority process:
repeat_schedule: next = init_tasks[this_cpu]; c = -1000; list_for_each(tmp, &runqueue_head) { p = list_entry(tmp, struct task_struct, run_list); if (p->cpus_runnable & p->cpus_allowed & (1 << this_cpu)) { int weight = goodness(p, this_cpu, prev->active_mm); if (weight > c) c = weight, next = p; } }
The function initializes next
so it points to the
process referenced by
init_task[this_cpu]
—that is, to the process
(swapper) associated with the executing CPU; the
c
local variable is set to - 1000. As we shall see
in the later section Section 11.2.2.5,
the goodness( )
function returns an integer that
denotes the priority of the process passed as parameter.
While scanning processes in the runqueue, schedule( )
considers only those that are both:
Runnable on the executing CPU (cpus_allowed & (1<<this_cpu)
).
Not already running on some other CPU. (cpus_runnable & (1<<this_cpu)
; see the description of
cpus_runnable
in the previous section.)
The loop selects the first process in the runqueue that has the
maximum weight. Thus, at the end of the search,
next
points to the best candidate, and the
c
local variable contains its priority. It is
possible that the runqueue list is empty; in this case, the cycle is
not executed, and next
points to the
swapper kernel thread associated with the
executing CPU. It is also possible that the best candidate turns out
to be the old current process prev
.
A particular case occurs when the local variable c
is set to 0 at the exit of the loop. This happens only when all
processes in the runqueue list that are runnable by the executing CPU
have exhausted their quantum—i.e., all of them have a zero
counter
field. A new epoch must then begin, and
schedule( )
assigns to all existing processes (not
only to the TASK_RUNNING
ones) a fresh quantum,
whose duration is half the counter
value plus an
increment that depends on the value of nice
:
if (!c) { struct task_struct *p; spin_unlock_irq(&runqueue_lock); read_lock(&tasklist_lock); for_each_task(p) p->counter = (p->counter >> 1) + (20 - p->nice) / 4 + 1; read_unlock(&tasklist_lock); spin_lock_irq(&runqueue_lock); goto repeat_schedule; }
In this way, suspended or stopped processes have their dynamic
priorities periodically increased. As stated earlier, the rationale
for increasing the counter
value of suspended or
stopped processes is to give preference to I/O-bound processes.
However, no matter how often the quantum is increased, its value can
never become greater than about 230 ms.[76]
Let’s assume now that schedule( )
has selected its best candidate, and that next
points to its process descriptor. The function updates the
aligned_data
array element of the executing CPU
(this element is referenced by the sched_data
local variable), writes the index of the executing CPU in
next
’s process descriptor,
releases the runqueue list spin lock, and reenables local interrupts:
sched_data->curr = next; next->processor = this_cpu; next->cpus_runnable = 1UL << this_cpu; spin_unlock_irq(&runqueue_lock);
Now schedule( )
is ready to proceed with the
actual process switch. But wait a moment! If the
next
best candidate is the previously running
process prev
, schedule( )
can
terminate:
if (prev == next) { prev->policy &= ~SCHED_YIELD; if (prev->lock_depth >= 0) spin_lock(&kernel_flag); return; }
Notice that schedule( )
reacquires the global
kernel lock if the lock_depth
field of the process
is not negative, as we anticipated when we described the first
actions of the function.
If a process other than prev
is selected, a
process switch must take place. The current value of the Time Stamp
Counter, fetched by means of the rdtsc
assembly
language instruction, is stored in the
last_schedule
field of the
aligned_data
array element of the executing CPU:
asm volatile("rdtsc" : "=A" (sched_data->last_schedule));
The context_swtch
field of
kstat
is also increased by 1 to update the
statistics maintained by the kernel:
kstat.context_swtch++;
It is also crucial to set up the address space of
next
properly. We know from Chapter 8 that the active_mm
field of
the process descriptor points to the memory descriptor that is
effectively used by the process, while the mm
field points to the memory descriptor owned by the process. For
normal processes, the two fields hold the same address; however, a
kernel thread does not have its own address space and its
mm
field is always set to NULL
.
The schedule( )
function ensures that if
next
is a kernel thread, then it uses the address
space used by prev
:
if (!next->mm) { next->active_mm = prev->active_mm; atomic_inc(&prev->active_mm->mm_count); cpu_tlbstate[this_cpu].state == TLBSTATE_LAZY; }
In earlier versions of Linux, kernel threads had their own address
space. That design choice was suboptimal when the scheduler selected
a kernel thread as a new process to run because changing the Page
Tables was useless; since any kernel thread runs in Kernel Mode, it
uses only the fourth gigabyte of the linear address space, whose
mapping is the same for all processes in the system. Even worse,
writing into the cr3
register invalidates all TLB
entries (see Section 2.4.8), which leads to a
significant performance penalty. Linux 2.4 is much more efficient
because Page Tables aren’t touched at all if
next
is a kernel thread. As further optimization,
if next
is a kernel thread, the schedule( )
function sets the process into lazy TLB mode (see
Section 2.4.8).
Conversely, if next
is a regular process, the
schedule( )
function replaces the address space of
prev
with the one of next
:
if (next->mm) switch_mm(prev->active_mm, next->mm, next, this_cpu);
If prev
is a kernel thread, the schedule( )
function releases the address space used by
prev
and resets
prev->active_mm
:
if (!prev->mm) { mmdrop(prev->active_mm); prev->active_mm = NULL; }
Recall that mmdrop( )
decrements the usage counter
of the memory descriptor; if the counter reaches 0, it also frees the
descriptor together with the associated Page Tables and virtual
memory regions.
Now schedule( )
can finally invoke
switch_to( )
to perform the process switch between
prev
and next
(see
Section 3.3.3):
switch_to(prev, next, prev);
The instructions of the
schedule( )
function following the
switch_to
macro invocation will not be performed
right away by the next
process, but at a later
time by prev
when the scheduler selects it again
for execution. However, at that moment, the prev
local variable does not point to our original process that was to be
replaced when we started the description of schedule( )
, but rather to the process that was replaced by our
original prev
when it was scheduled again. (If you
are confused, go back and read Section 3.3.3.)
The last instructions of the schedule( )
function
are:
_ _schedule_tail(prev); if (current->lock_depth >= 0) spin_lock(&kernel_flag); if (current->need_resched) goto need_resched_back; return;
As you see, schedule( )
invokes _ _schedule_tail( )
(described next), reacquires the global
kernel lock if necessary, and checks whether some other process has
set the need_resched
field of
prev
while it was not running. In this case, the
whole schedule( )
function is reexecuted from the
beginning; otherwise, the function terminates.
In uniprocessor systems, the _ _schedule_tail( )
function limits itself to clear the SCHED_YIELD
flag of the policy
field of
prev
. Conversely, in multiprocessor systems, the
function executes code that is essentially equivalent to the
following fragment:
policy = prev->policy; prev->policy = policy & ~SCHED_YIELD; wmb( ); spin_lock(&prev->alloc_lock); prev->cpus_runnable = ~0UL; spin_lock_irqsave(&runqueue_lock, flags); if (prev->state == TASK_RUNNING && prev != init_task[smp_processor_id( )] && prev->cpus_runnable == ~0UL && !(policy & SCHED_YIELD)) reschedule_idle(prev); spin_unlock_irqrestore(&runqueue_lock, flags); spin_unlock(&prev->alloc_lock);
The wmb( )
memory barrier is used to make sure
that the processor won’t reshuffle the assembly
language instructions that modify the policy
field
with those that acquire the alloc_lock
spin lock
(see Section 5.3.2).
As you may notice, the role of _ _schedule_tail( )
is far more important in multiprocessor systems because this function
checks whether the process that was replaced can be rescheduled on
some other CPU. This attempt is performed only if the following
conditions are satisfied:
prev
is in TASK_RUNNING
state.
prev
is not the swapper
process of the executing CPU.
The SCHED_YIELD
flag of
prev->policy
was not set.
prev
was not already selected by another CPU in
the time frame elapsed between the assignment to the
cpus_runnable
field and the if
statement (the if
statement itself is protected by
the runqueue_lock
spin lock; see the code shown in
the previous section).
To check whether the priority of prev
is high
enough to replace the current process of some other CPU, _ _schedule_tail( )
invokes reschedule_idle( )
. This is the same function invoked by
wake_up_process( )
and is described in the later
section Section 11.2.2.6.
The next two sections complete the analysis of the scheduler. They
describe, respectively, the goodness( )
and
reschedule_idle( )
functions.
The heart of the scheduling algorithm
includes identifying the best candidate among all processes in the
runqueue list. This is what the goodness( )
function does. It receives as input parameters:
p
, the descriptor pointer of
candidate process
this_cpu
, the logical number of the executing CPU
this_mm
, the memory descriptor
address of the process being replaced
The integer value weight
returned by
goodness( )
measures the
“goodness” of p
and has the following meanings:
weight
= -1
p
is the prev
process, and its
SCHED_YIELD
flag is set. The process will be
selected only if no other runnable processes (beside the
swapper processes) are included in the runqueue.
weight
= 0
p
is a conventional process that has exhausted its
quantum (p->counter
is zero). Unless all
runnable processes have also exhausted their quantum, it will not be
selected for execution.
weight
<= 77
p
is a conventional process that has not exhausted
its quantum. The weight is computed as follows:
weight = p->counter + 20 - p->nice; if (p->processor == this_cpu) weight += 15; if (p->mm == this_mm || !p->mm) weight += 1;
In multiprocessor systems, a large bonus (+ 15) is given to the process if it was last running on the CPU that is executing the scheduler. The bonus helps in reducing the number of transfers of processes across several CPUs during their executions, thus yielding a smaller number of hardware cache misses.
The function also gives a small bonus (+ 1) to the process if it is a
kernel thread or it shares the memory address space with the
previously running process. Again, the process is favored mainly
because the TLBs must not be invalidated by writing into the
cr3
register.
weight
>= 1000
p
is a real-time process. The weight is given by
p->counter + 1000
.
With respect to earlier versions, the Linux 2.4 scheduling algorithm has been improved to enhance its performance on multiprocessor systems. It was also simplified, which is a great improvement by itself.
As we have seen, each processor runs the schedule( )
function on its own to replace the process that is
currently in execution. However, processors are able to exchange
information to boost system performance. In particular, right after a
process switch, any processor usually checks whether the just
replaced process should be executed on some other CPU running a lower
priority process. This check is performed by
reschedule_idle( )
.
The reschedule_idle( )
function looks for some
other CPU to run the process p
passed as parameter
and uses interprocessor interrupts to force other CPUs to perform
scheduling. The function performs a series of tests in a fixed order.
If one of them is successful, the function sends a
RESCHEDULE_VECTOR
interprocessor interrupt to the
selected CPU (see Section 4.6.2) and
returns. If all tests fail, the function returns without forcing a
rescheduling. The tests are performed in the following order:
Is the CPU that was last running p
(i.e., the one
having index p->processor
) currently idle?
best_cpu = p->processor; if ((p->cpus_allowed & p->cpus_runnable & (1 << best_cpu)) && cpu_curr(best_cpu) == init_tasks[best_cpu]) { send_now_idle: need_resched= init_tasks[best_cpu]->need_resched; init_tasks[best_cpu]->need_resched = 1; if (best_cpu != smp_processor_id( ) && !need_resched) smp_send_reschedule(best_cpu); }
This is the best possible case because no process is to be preempted
and the hardware cache of the processor is warm (filled with useful
data). Notice that this case cannot happen when
reschedule_idle( )
is invoked by the scheduler
because schedule( )
never replaces a runnable
process with the swapper kernel thread. This
case may happen, however, when reschedule_idle( )
is invoked by wake_up_process( )
—that is,
when p
has just been woken up.
To force the rescheduling on the target processor, the
need_resched
field of the
swapper kernel thread is set. If the target
processor is different from the one executing the
reschedule_idle( )
function, a
RESCHEDULE_VECTOR
interprocessor interrupt is also
raised. In fact, the idle processor usually executes a
halt
assembly language instruction to save power,
and it can be woken up only by an interrupt. It is also possible,
however, to let the swapper kernel thread actively poll the
need_resched
field, waiting for its value to
change from - 1 to +1, in order to speed up the rescheduling and
avoid the interprocessor interrupt. This much more power-consuming
algorithm can be activated by passing the
“idle=poll”
parameter to the kernel in the booting phase.
Does an idle processor exist that can execute p
?
oldest_idle = -1; for (cpu=0; cpu<smp_num_cpus; cpu++) { if (!(p->cpus_allowed & p->cpus_runnable & (1 << cpu))) continue; if (cpu_curr(cpu) == init_tasks[cpu] && last_schedule(cpu) < oldest_idle) oldest_idle = last_schedule(cpu), target_tsk = cpu_curr(cpu); } if (oldest_idle != -1) { best_cpu = target_tsk->processor; goto send_now_idle; }
Among the idle processors that can execute p
, the
function selects the least recently active. Recall that the Time
Stamp Counter value of the last process switch of every CPU is stored
in the aligned_data
array (see the earlier section
Section 11.2.1). Once the function
finds the oldest idle CPU, the function jumps to the code already
described in the previous case to force a rescheduling. The rationale
behind the “oldest idle rule” is
that this CPU is likely to have the greatest number of invalid
hardware cache lines.
Does there exist a processor that can execute p
and whose current process has lower dynamic priority than
p
?
max_prio = 0; for (cpu=0; cpu<smp_num_cpus; cpu++) { if (!(p->cpus_allowed & p->cpus_runnable & (1 << cpu))) continue; prio = goodness(p, cpu, cpu_curr(cpu)->active_mm) - goodness(cpu_curr(cpu), cpu, cpu_curr(cpu)->active_mm); if (prio > max_prio) max_prio = prio, target_tsk = cpu_curr(cpu); } if (max_prio > 0) { target_tsk->need_resched = 1; if (target_tsk->processor != smp_processor_id( )) smp_send_reschedule(target_tsk->processor); }
reschedule_idle( )
finds the processor for which
the difference between the goodness of replacing the current process
with p
and the goodness of replacing the current
process with the current process itself is maximum. The function
forces a rescheduling on the corresponding processor if the maximum
is a positive value. Notice that the function
doesn’t simply look at the
counter
and nice
fields of the
processes; rather, it uses goodness( )
, which
takes into consideration the cost of replacing the currently running
process with another process that potentially uses a different
address space.
The scheduling algorithm of Linux is both self-contained and relatively easy to follow. For this reason, many kernel hackers love to try to make improvements. However, the scheduler is a rather mysterious component of the kernel. While you can change its performance significantly by modifying just a few key parameters, there is usually no theoretical support to justify the results obtained. Furthermore, you can’t be sure that the positive (or negative) results obtained will continue to hold when the mix of requests submitted by the users (real-time, interactive, I/O-bound, background, etc.) varies significantly. Actually, for almost every proposed scheduling strategy, it is possible to derive an artificial mix of requests that yields poor system performances.
Let’s try to outline some pitfalls of the Linux 2.4 scheduler. As it turns out, some of these limitations become significant on large systems with many users. On a single workstation that is running a few tens of processes at a time, the Linux 2.4 scheduler is quite efficient.
If the number of existing processes is very large, it is inefficient to recompute all dynamic priorities at once.
In old traditional Unix kernels, the dynamic priorities were recomputed every second, so the problem was even worse. Linux tries instead to minimize the overhead of the scheduler. Priorities are recomputed only when all runnable processes have exhausted their time quantum. Therefore, when the number of processes is large, the recomputation phase is more expensive but executed less frequently.
This simple approach has a disadvantage: when the number of runnable processes is very large, I/O-bound processes are seldom boosted, and therefore, interactive applications have a longer response time.
The system responsiveness experienced by users depends heavily on the system load , which is the average number of processes that are runnable and thus waiting for CPU time.[77]
As mentioned before, system responsiveness also depends on the average time-quantum duration of the runnable processes. In Linux, the predefined time quantum appears to be too large for high-end machines that have a very high expected system load.
The preference for I/O-bound processes is a good strategy to ensure a short response time for interactive programs, but it is not perfect. Indeed, some batch programs with almost no user interaction are I/O-bound. For instance, consider a database search engine that must typically read lots of data from the hard disk, or a network application that must collect data from a remote host on a slow link. Even if these kinds of processes do not need a short response time, they are boosted by the scheduling algorithm. On the other hand, interactive programs that are also CPU-bound may appear unresponsive to the users, since the increment of dynamic priority due to I/O blocking operations may not compensate for the decrement due to CPU usage.
As stated in the first chapter, nonpreemptive kernels are not well suited for real-time applications, since processes may spend several milliseconds in Kernel Mode while handling an interrupt or exception. During this time, a real-time process that becomes runnable cannot be resumed. This is unacceptable for real-time applications, which require predictable and low response times.[78]
Future versions of Linux might address this problem, by either implementing SVR4’s “fixed preemption points” or making the kernel fully preemptive. It remains questionable, however, whether these design choices are appropriate for a general-purpose operating systems such as Linux.
Kernel preemption, in fact, is just one of several necessary conditions for implementing an effective real-time scheduler. Several other issues must be considered. For instance, real-time processes often must use resources that are also needed by conventional processes. A real-time process may thus end up waiting until a lower-priority process releases some resource. This phenomenon is called priority inversion . Moreover, a real-time process could require a kernel service that is granted on behalf of another lower-priority process (for example, a kernel thread). This phenomenon is called hidden scheduling . An effective real-time scheduler should address and resolve such problems.
Luckily, all these shortcomings have been fixed in the brand new scheduler developed by Ingo Molnar that is included in the Linux 2.5 current development version. This scheduler is so efficient that it has been back-ported to Linux 2.4 and adopted by some commercial distributions of the GNU/Linux system.
[74] Beside the
values 0 (false) and 1 (true), the need_resched
field of a swapper kernel thread (PID 0) in a
multiprocessor system can also assume the value - 1; see the later
section Section 11.2.2.6 for details.
[75] Recall that in the 80 × 86 architecture, 1 tick corresponds to roughly 10 ms (see Section 6.1.3). In all architectures, however, the formula that computes the number of ticks in a quantum is adapted so the default quantum has an order of magnitude of 50 ms.
[76] For the
mathematically inclined, here is a sketch of the proof: when a new
epoch starts, the value of counter
is bounded by
half of the previous value of counter
plus
P, which is the maximum value that can be added
to counter
. If nice
is set to
-20, then P is equal to 11 ticks. Solving the
recurrence equation yields as upper bound the geometric series
P × ( 1 +
1
/2
+
1
/4
+
1
/8
+ . . . ), which converges to 2 × P
(that is, 22 ticks).
[77] The uptime
program returns the system load for the past 1, 5, and 15
minutes. The same information can be obtained by reading the
/proc/loadavg
file.
[78] The Linux kernel has been modified in several ways so it can handle a few hard real-time jobs if they remain short. Basically, hardware interrupts are trapped and kernel execution is monitored by a kind of “superkernel.” These changes do not make Linux a true real-time system, though.