Many features of the
Linux kernel are critically dependent on the passage of time. The Linux kernel makes use of different timers provided by hardware to provide time-dependent services such as
busy wait and
sleep wait. The kernel also facilitates scheduling for functions that desire to run after some specified time has elapsed.
In this month’s “Gearheads” column, let’s explore the the semantics of kernel variables like
jiffies, HZ, loops_per_jiffy, and
xtime, learn how to measure execution times using the
Pentium Time Stamp Counter (TSC), and use the
Real-Time Clock (RTC) to perform periodic work from an application with microsecond precision.
Of HZ and Jiffies
System timers interrupt the processor, or
pop, at programmable frequencies. This frequency, or the number of timer ticks per second, is contained in the kernel variable
HZ. Choosing a value for
HZ is a trade-off: a large
HZ value results in finer timer granularity and better resolution. However, high values of
HZ also result in larger overhead, since more cycles are spent executing the timer interrupt handler.
The value of
HZ is architecture-dependent. On
x86 systems,the
2.4 kernels set
HZ to 100, but with
2.6, this value was changed to 1000. On ARM-based platforms, 2.6 kernels use a
HZ value of 100.
The
jiffies variable holds the number of times the system timer popped since the system booted. The kernel increments
jiffies
HZ times every second. Thus, on a kernel with a
HZ value of 100, a “jiffy” is a 10-millisecond duration, while on a kernel with
HZ set to 1000, a jiffy is only 1-millisecond.
To better understand
HZ and
jiffies, consider the following code snippet from the
IDE driver (
drivers/ide/ide.c) that polls disk drives for busy status:
unsigned long timeout = jiffies + (3 * HZ);
while (hwgroup->busy) {
/* … */
if (time_after (jiffies, timeout)) {
return -EBUSY;
}
/* continue your work */
}
return SUCCESS;
The snippet returns
SUCCESS if the busy condition gets cleared in less than 3 seconds, and
-EBUSY otherwise.
3*HZ is the number of
jiffies present in 3 seconds. The calculated timeout value (
jiffies+3*HZ) is thus, the value that will be contained in
jiffies after the 3-second timeout has elapsed. The
time_after() macro compares the current value of jiffies with the requested timeout, taking care to account for wrap-arounds due to overflows. Related functions available for doing similar comparisons are
time_before(),
time_before_eq() and
time_after_eq().
jiffies is defined as
volatile, which asks the compiler not to optimize access to the variable. This ensures that
jiffies (which is updated by the timer interrupt handler during each tick) is re-read during each pass through the loop.
For the reverse conversion from jiffies to seconds, have a look at this snippet from the SCSI driver, (
drivers/block/scsi_ioctl.c):
start_time = jiffies;
blk_execute_rq (q, bd_disk, rq);
/* … */
hdr->duration = ((jiffies - start_time) * 1000) / HZ;
Here,
hdr->duration contains the amount of time (in milliseconds) that it took for a
SCSI command to execute.
(jiffies-start_time) is the number of jiffies that elapsed while the command was executing. The division by
HZ converts that value into seconds, while the multiplication by 1000 changes the measure to milliseconds.
Delaying Execution for Long Durations
In kernel terms, delays in the order of jiffies can be considered long durations. A possible, but non-optimal way to accomplish long delays is by busy waiting. A function that busy waits has a “dog-in-the-manger” attitude. It neither uses the processor for doing useful work, nor does it let others use it.
For example, the following code hogs the processor for one second:
unsigned long timeout = jiffies + HZ;
while (time_before (jiffies, timeout));
A better approach is to sleep wait, because it can yield the processor to others while waiting for the time delay to elapse. This can be done using
schedule_timeout():
unsigned long timeout = jiffies + HZ;
schedule_timeout (timeout);
The delay guarantee is only on the lower bound of the timeout. Whether from kernel space or from user space, it is difficult to get more precise control over timeouts than the granularity of
HZ, since process time slices are updated by the scheduler only during timer ticks. Also, even if your process enters the run queue after the specified timeout, the scheduler can decide to pick another process from the queue based on priorities.
Two other functions that facilitate sleep waiting are
wait_event_timeout() and
msleep(). Both are implemented with the help of the
schedule_timeout() function.
wait_event_timeout() can be used if your code desires to resume execution either if a specified condition becomes true or if a timeout occurs.
msleep() sleeps for the specified number of milliseconds.
These delay techniques are suitable for use only from process context. Busy waiting for long durations is possible from interrupt context, but is considered a mortal sin. Equally taboo is busy waiting with interrupts disabled. Sleep waiting cannot be done from interrupt context since interrupt handlers are not allowed to schedule or sleep.
Doing Work in the Future
The kernel also provides programming interfaces to execute a function at some point in the future. You can populate a
timer_list with the address and parameters of your function and register it using
add_timer(), as shown in
Listing One.
#include <linux/timer.h>
struct timer_list my_timer;
/* … */
init_timer (&my_timer);
my_timer.expire = jiffies + n*HZ; /* n is the number of seconds */
my_timer.data = func_parameter; /* Parameter to be passed to timer_func */
my_timer.function = timer_func; /* Function to execute after n seconds */
add_timer (&my_timer);
/* … */
timer_func() runs
once with parameter
func_parameter. If you’d like to run
timer_func() periodically, you can repeat the code in
Listing One in
timer_func() itself to reschedule the function over and over. The new function is shown in
Listing Two.
static void timer_func (unsigned long func_parameter)
{
/* Do work to be done periodically */
/* … */
init_timer (&my_timer);
my_timer.expire = jiffies + n*HZ;
my_timer.data = func_parameter;
my_timer.function = timer_func;
add_timer (&my_timer);
}
You can use
mod_timer() to change the expiration of
my_timer, and
del_timer() to cancel
my_timer. If you look at
kernel/timer.c,, you can see that the
schedule_timeout() function internally uses these same APIs.
Delaying Execution for Small Durations
In kernel terms, sub-jiffy delays qualify as small durations. Such delays are commonly requested from both process and interrupt contexts.
Since it’s not possible to use jiffy-based methods to implement sub-jiffy delays, the methods discussed in the previous section to sleep wait cannot be used for small timeouts. The only solution is to busy wait.
The kernel APIs to implement short delays are
mdelay(),
udelay(), and
ndelay(), which support millisecond, microsecond and nanosecond delays, respectively. The actual implementations of these functions are architecture-specific and need not be supported on all platforms.
Busy waiting for short durations is accomplished by measuring the time that the processor takes to execute an instruction and looping for the necessary number of iterations. The busy loop calibration is done during system boot and the value is stored in a variable named
loops_per_jiffy. The short-delay APIs use the
loops_per_jiffy value to decide the number of times they need to busy loop. To achieve a one-microsecond delay during a handshake process, the USB host controller driver,
drivers/usb/host/ehci-hcd.c, uses
udelay():
do {
result = readl (ptr);
/* … */
if (result == done) return 0;
udelay (1); /* Internally uses loops_per_jiffy */
usec--;
} while (usec > 0);
The Pentium Time Stamp Counter
The Time Stamp Counter (TSC) is a 64-bit timer register in Pentium-compatible processors that counts the number of clock cycles consumed by the processor. Since the TSC gets incremented at the rate of the processor cycle speed, it provides a high-resolution timer and is commonly used for profiling and instrumenting code. It can be accessed using the
rdtsc instruction to measure execution time of intervening code with microsecond precision. TSC ticks can be converted to seconds by dividing by the CPU clock speed, which can be read from the kernel variable,
cpu_khz.
In
Listing Three, the
low_tsc_ticks variables contain the lower 32 bits of the TSC, while the
high_tsc_ticks variables contain the higher 32 bits. The lower 32 bits overflow in a few seconds depending on your processor speed, but is sufficient for many code instrumentation purposes.
unsigned long low_tsc_ticks0, high_tsc_ticks0;
unsigned long low_tsc_ticks1, high_tsc_ticks1;
unsigned long exec_time;
rdtsc (low_tsc_ticks0, high_tsc_ticks0); /* timestamp before */
printk (“Hello World\n”); /* Code to be profiled */
rdtsc (low_tsc_ticks1, high_tsc_ticks1); /* timestamp after */
exec_time = low_tsc_ticks1 – low_tsc_ticks0;
On a 1.8 GHz Pentium,
exec_time measured 871 (or half a microsecond).
User space functions like
nanosleep(),
clock_settime(), and
clock_gettime() can be used to access kernel timer services from user space. When used in tandem with real time scheduling,
nanosleep() busy waits for requested delays of less than two milliseconds, using
mdelay(). For larger delays,
nanosleep() sleep waits. The
getitimer() and
setitimer() functions deliver a signal to the calling process when a specified timeout expires. The high-resolution timer patch (not part of the mainstream kernel) makes use of the TSC in Pentium-class machines to support high-precision for many of these functions.
The Real Time Clock
The Real Time Clock (RTC) tracks absolute time in nonvolatile memory. On x86 PCs, RTC registers constitute the top few locations of the CMOS. If you enable
CONFIG_RTC during kernel configuration, the RTC can be accessed via the character device,
/dev/rtc. Only one process is allowed to access this device at a time.
The RTC can be used to:
*Read and set the absolute clock and generate interrupts during clock updates.
*Generate periodic interrupts with frequencies ranging from 2 Hz to 8192 Hz.
*Set an alarm for a future time.
Many applications need the concept of absolute time or
wall time.
jiffies does not serve this purpose since it is relative to the time when the system booted. The kernel maintains wall time in a variable called
xtime. During boot up,
xtime is initialized to the current wall time by reading the RTC. When the system halts, the wall time is written back to the RTC.
You can use the
do_gettimeofday() routine to read the wall time with the highest resolution supported by the hardware. This is shown in
Listing Four.
#include <linux/time.h>
static struct timeval curr_time;
do_gettimeofday (&curr_time);
my_timestamp = cpu_to_le32 (curr_time.tv_sec); /* Record timestamp */
There are a number of functions available to user code to access wall time, including:
*
time(), which returns the calendar time, or number of seconds since the” Epoch” (00:00:00 on January 1, 1970).
*
localtime(), a function that returns the calendar time in a decomposed format (see the definition of
struct tm in
time.h).
*
mktime(), which does the reverse of
localtime().
*And
gettimeofday(), which can return the calendar time with microsecond precision, if your platform supports it.
Doing Periodic Work With Microsecond Precision
Many real-time applications must perform some work periodically in a time bound manner. The RTC is a timer source that can generate periodic interrupts with microsecond precision.
Listing Five shows an example that reads
/dev/rtc to get regular interrupt reports. The Pentium TSC is used to measure response times.
The real time nature of data collection can be affected due to preemptive scheduling. To guard against this indeterminism, the program in
Listing Five changes its scheduling policy to
SCHED_FIFO using the
sched_setscheduler() function. (
SCHED_OTHER is the default scheduling policy used by most processes.)
SCHED_FIFO is a real time scheduling policy and always preempts a currently running
SCHED_OTHER process. (The third scheduling policy supported on Linux is called
SCHED_RR, or Round Robin scheduling, which is an enhancement of
SCHED_FIFO.)
SCHED_FIFO and
SCHED_RR cannot prevent latencies due to interrupts.
Next, the
mlockall() routine is invoked to lock all mapped pages in memory to ensure that swapping won’t interfere with deterministic timing. Only superuser processes are allowed to use
mlockall() or change the scheduling policy.
#include <stdio.h>
#include <linux/rtc.h>
#include <sys/ioctl.h>
#include <sys/time.h>
#include <fcntl.h>
#include <pthread.h>
#include <linux/mman.h>
/* Read the lower half of the Pentium Time Stamp Counter
* using the I<rdtsc> instruction
*/
#define rdtscll(val) __asm__ __volatile__ (“rdtsc” : “=A” (val))
main ()
{
long long ts0,ts1; /* Store TSC ticks */
struct sched_param sched_p; /* Scheduling Policy related information */
long now, worst, data;
int fd, i=0;
/* Change the scheduling policy to SCHED_FIFO */
sched_getparam (getpid(), &sched_p);
sched_p.sched_priority = 50;
sched_setscheduler (getpid(), SCHED_FIFO, &sched_p);
/* Disable Paging */
mlockall (MCL_CURRENT);
/* Open the RTC device */
fd = open (“/dev/rtc”, O_RDONLY);
/* Set the periodic interrupt frequency to 8192 Hz
* This should give an interrupt rate of 122 uS
*/
ioctl (fd, RTC_IRQP_SET, 8192);
/* Turn ON periodic interrupt */
ioctl (fd, RTC_PIE_ON, 0);
rdtscll (ts0);
worst = 0;
while (i++ < 10000) {
/* Block till the next periodic interrupt occurs */
read (fd, &data, sizeof (unsigned long));
/* Use the TSC to precisely measure the time consumed.
* Reading the lower half of the TSC is sufficient */
rdtscll (ts1);
now = (ts1-ts0);
/* Update the worst case latency */
if (now > worst) worst = now;
ts0 = ts1;
/* Do the work that is to be done periodically */
}
printf(“Worst latency was %8ld us\n”, worst);
/* Turn OFF periodic interrupts */
ioctl (fd, RTC_PIE_OFF, 0);
}
The code in
Listing Five loops for 10,000 iterations and prints the worst case latency that occurred during execution. On a 1.8 GHz Pentium, the output was 240,899, which roughly corresponds to 133 microseconds. According to the data sheet of the RTC chipset, a timer frequency of 8,192 Hz should result in a periodic interrupt rate of 122 microseconds. That looks close enough. You can run the code using
SCHED_OTHER instead of
SCHED_FIFO and see the resultant drift.
Looking at the Sources
Timer services in the kernel consist of architecture-dependent portions that live in the
arch/…/kernel/ subdirectory, and generic portions defined in
kernel/timer.c/ For related definitions, look at the header files,
include/linux/time*.h.
jiffies is defined in
linux/jiffies.h. The value for
HZ is processor-dependent and can be found in
include/asm-…/param.h. Look at
drivers/char/rtc.c for the Real Time Clock driver. To see how the kernel calibrates
loops_per_jiffy, look at
init/calibrate.c.
(http://www.linux-mag.com/id/2272/)