Linux must carry on several time-related activities. For instance, the kernel periodically:
Updates the time elapsed since system startup.
Updates the time and date.
Determines, for every CPU, how long the current process has been running, and preempts it if it has exceeded the time allocated to it. The allocation of time slots (also called quanta ) is discussed in Chapter 11.
Updates resource usage statistics.
Checks whether the interval of time associated with each software timer (see the later section Section 6.6) has elapsed.
Linux’s timekeeping architecture is the set of kernel data structures and functions related to the flow of time. Actually, Intel-based multiprocessor machines have a timekeeping architecture that is slightly different from the timekeeping architecture of uniprocessor machines:
In a uniprocessor system, all time-keeping activities are triggered by interrupts raised by the Programmable Interval Timer.
In a multiprocessor system, all general activities (like handling of software timers) are triggered by the interrupts raised by the PIT, while CPU-specific activities (like monitoring the execution time of the currently running process) are triggered by the interrupts raised by the local APIC timers.
Unfortunately, the distinction between the two cases is somewhat blurred. For instance, some early SMP systems based on Intel 80486 processors didn’t have local APICs. Even nowadays, there are SMP motherboards so buggy that local timer interrupts are not usable at all. In these cases, the SMP kernel must resort to the UP timekeeping architecture. On the other hand, recent uniprocessor systems have a local APIC and an I/O APIC, so the kernel may use the SMP timekeeping architecture. Another significant case holds when a SMP-enabled kernel is running on a uniprocessor machine. However, to simplify our description, we won’t discuss these hybrid cases and will stick to the two “pure” timekeeping architectures.
Linux’s timekeeping architecture depends also on the availability of the Time Stamp Counter (TSC). The kernel uses two basic timekeeping functions: one to keep the current time up to date and another to count the number of microseconds that have elapsed within the current second. There are two different ways to get the last value. One method is more precise and is available if the CPU has a Time Stamp Counter; a less-precise method is used in the opposite case (see the later section Section 6.7.1).
In a uniprocessor system, all time-related activities are triggered by the interrupts raised by the Programmable Interval Timer on IRQ line 0. As usual, in Linux, some of these activities are executed as soon as possible after the interrupt is raised (in the “top half” of the interrupt handler), while the remaining activities are delayed (in the “bottom half” of the interrupt handler).
The time_init( )
function sets up the interrupt gate corresponding to IRQ 0 during
kernel setup. Once this is done, the handler
field
of IRQ 0’s irqaction
descriptor
contains the address of the timer_interrupt( )
function. This function starts running with the interrupts disabled,
since the status
field of IRQ 0’s
main descriptor has the SA_INTERRUPT
flag set. It
performs the following steps:
If the CPU has a TSC register, it performs the following substeps:
Executes an rdtsc
assembly language instruction to
store the 32 least-significant bits of the TSC register in the
last_tsc_low
variable.
Reads the state of the 8254 chip device internal oscillator and computes the delay between the timer interrupt occurrence and the execution of the interrupt service routine.[47]
Stores that delay (in microseconds) in the
delay_at_last_interrupt
variable; as we shall see
in Section 6.7.1, this
variable is used to provide the correct time to user processes.
It invokes do_timer_interrupt( )
.
do_timer_interrupt( )
, which may be considered the
PIT’s interrupt service routine common to all 80
× 86 models, essentially executes the following
operations:
It invokes the do_timer( )
function, which is
fully explained shortly.
If the timer interrupt occurred in Kernel Mode, it invokes the
x86_do_profile( )
function (see Section 6.5.3 later in this chapter).
If an adjtimex( )
system call is issued, it
invokes the set_rtc_mmss( )
function once every
660 seconds (every 11 minutes) to adjust the Real Time Clock. This
feature helps systems on a network synchronize their clocks (see the
later section Section 6.7.2).
The do_timer( )
function, which runs with the
interrupts disabled, must be executed as quickly as possible. For
this reason, it simply updates one fundamental value—the time
elapsed from system startup—and checks whether the running
processes have exhausted its time quantum while delegating all
remaining activities to the TIMER_BH
bottom half.
The function is equivalent to:
void do_timer(struct pt_regs * regs) { jiffies++; update_process_times(user_mode(regs)); /* UP only */ mark_bh(TIMER_BH); if (TQ_ACTIVE(tq_timer)) mark_bh(TQUEUE_BH); }
The
jiffies
global variable stores the number of elapsed ticks since the system
was started. It is set to 0 during kernel initialization and
incremented by 1 when a timer interrupt occurs — that is, on
every tick. Since jiffies
is a 32-bit unsigned
integer, it returns to 0 about 497 days after the system has been
booted. However, the kernel is smart enough to handle the overflow
without getting confused.
The update_process_times( )
function essentially
checks how long the current process has been running; it is described
in Section 6.3 later
in this chapter.
Finally do_timer( )
activates the
TIMER_BH
bottom half; if the
tq_timer
task queue is not empty (see
Section 4.7), the function also activates the
TQUEUE_BH
bottom half.
Each invocation of the
“top half” PIT’s
timer interrupt handler marks the TIMER_BH
bottom
half as active. As soon as the kernel leaves interrupt mode, the
timer_bh( )
function, which is associated with
TIMER_BH
, starts:
void timer_bh(void) { update_times( ); run_timer_list( ); }
The update_times( )
function updates the system
date and time and computes the current system load; these activities
are discussed later in Section 6.4 and Section 6.5. The
run_timer_list( )
function takes care of software
timers handling; it is discussed in the later section Section 6.6.
In multiprocessor systems, timer
interrupts raised by the Programmable Interval Timer still play an
important role. Indeed, the corresponding interrupt handler takes
care of activities not related to a specific CPU, such as the
handling of software timers and keeping the system time up to date.
As in the uniprocessor case, the most urgent activities are performed
by the “top half” of the interrupt
handler (see Section 6.2.1.1 earlier in this chapter),
while the remaining activities are delayed until the execution of the
TIMER_BH
bottom half (see the earlier section
Section 6.2.1.2).
However, the SMP version of the PIT’s interrupt service routine differs from the UP version in a few points:
The timer_interrupt( )
function acquires the
xtime_lock
read/write spin lock for writing.
Although local interrupts are disabled, the kernel must protect the
xtime
, last_tsc_low
, and
delay_at_last_interrupt
global variables from
concurrent read and write accesses performed by other CPUs (see
Section 6.4 later in this
chapter).
The do_timer_interrupt( )
function does not invoke
the x86_do_profile( )
function because this
function performs actions related to a specific CPU.
The do_timer( )
function does not invoke
update_process_times( )
because this function also
performs actions related to a specific CPU.
There are two timekeeping activities related to every specific CPU in the system:
Monitoring how much time the current process has been running on the CPU
Updating the resource usage statistics of the CPU
To simplify the overall timekeeping architecture, in Linux 2.4, every CPU takes care of these activities in the handler of the local timer interrupt raised by the APIC device embedded in the CPU. In this way, the number of accessed spin locks is minimized, since every CPU tends to access only its own “private” data structures.
During
kernel initialization, each APIC has to be told how often
to generate a local time interrupt. The setup_APIC_clocks( )
function programs the local APICs of all CPUS to generate
interrupts as follows:
void setup_APIC_clocks (void) { _ _cli( ); calibration_result = calibrate_APIC_clock( ); setup_APIC_timer((void *)calibration_result); _ _sti( ); smp_call_function(setup_APIC_timer, (void *)calibration_result, 1, 1); }
The calibrate_APIC_clock( )
function computes how
many local timer interrupts are generated by the local APIC of the
booting CPU during a tick (10 ms). This exact value is then used to
program the local APICs in such a way to generate one local timer
interrupt every tick. This is done by the setup_APIC_timer( )
function, which is invoked directly on the booting CPU,
and through the CALL_FUNCTION_VECTOR
Interprocessor Interrupts (IPI) on the other CPUs (see
Section 4.6.2).
All local APIC timers are synchronized because they are based on the
common bus clock signal. This means that the value computed by
calibrate_APIC_clock( )
for the booting CPU is
good also for the other CPUs in the system. However, we
don’t really want to have all local timer interrupts
generated at exactly the same time because this could induce a
substantial performance penalty due to waits on spin locks. For the
same reason, a local timer interrupt handler should not run on a CPU
when a PIT’s timer interrupt handler is being
executed on another CPU.
Therefore, the setup_APIC_timer( )
function
spreads the local timer interrupts inside the tick interval. Figure 6-1 shows an example. In a multiprocessor systems
with four CPUs, the beginning of the tick is marked by the
PIT’s timer interrupt. Two milliseconds after the
PIT’s timer interrupt, the local APIC of CPU 0
raises its local timer interrupt; two milliseconds later, it is the
turn of the local APIC of CPU 1, and so on. Two milliseconds after
the local timer interrupt of CPU 3, the PIT raises another timer
interrupt on IRQ 0 line and starts a new tick.
setup_APIC_timer( )
programs the local APIC in
such a way to raise timer interrupts that have vector
LOCAL_TIMER_VECTOR
(usually,
0xef
); moreover, the init_IRQ( )
function associates LOCAL_TIMER_VECTOR
to the low-level interrupt handler apic_timer_interrupt( )
.
The
apic_timer_interrupt( )
assembly language function
is equivalent to the following code:
apic_timer_interrupt: pushl $LOCAL_TIMER_VECTOR-256 SAVE_ALL movl %esp,%eax pushl %eax call smp_apic_timer_interrupt addl $4,%esp jmp ret_from_intr
As you can see, the low-level handler is very similar to the other
low-level interrupt handlers already described in Chapter 4. The high-level interrupt handler called
smp_apic_timer_interrupt( )
executes the following
steps:
Gets the CPU logical number (say n)
Increments the n
th entry of the
apic_timer_irqs
array by 1 (see Section 6.5.4 later in this chapter)
Acknowledges the interrupt on the local APIC
Calls the irq_enter( )
function to increment the
n
th entry of the
local_irq_count
array and to honor the
global_irq_lock
spin lock (see Chapter 5)
Invokes the smp_local_timer_interrupt( )
function
Calls the irq_exit( )
function to decrement the
n
th entry of the
local_irq_count
array
Invokes do_softirq( )
if some softirqs are pending
(see Section 4.7.1)
The smp_local_timer_interrupt( )
function executes
the per-CPU timekeeping activities. Actually, it performs the
following steps:
Invokes the x86_do_profile( )
function if the
timer interrupt occurred in Kernel Mode (see Section 6.5.3 later in this chapter)
Invokes the update_process_times( )
function to
check how long the current process has been running (see
Section 6.6 later in this
chapter)[48]
[47] The 8254 oscillator drives a counter that is continuously decremented. When the counter becomes 0, the chip raises an IRQ 0. Thus, reading the counter indicates how much time has elapsed since the interrupt occurred.
[48] The system administrator can change the
sample frequency of the kernel code profiler. To do this, the kernel
changes the frequency at which local timer interrupts are generated.
However, the smp_local_timer_interrupt( )
function
keeps invoking the update_process_times( )
function exactly once every tick. Unfortunately, changing the
frequency of a local timer interrupt destroys the elegant spreading
of the local timer interrupts inside a tick interval.