This section will cover the internals of Interrupt Handling in Linux Kernel (all explaination is related to i386 platform). This section is under development and might be incomplete right now.
I will cover the following topics in this section, explaining the hardware as well as software part of it, How the interrupts are generated, routed and then handled by the low level code of Linux Kernel.
Introduction
Interrupt Routing
Details of Programmable Interrupt Controller
Details of Interrupt Descriptor Table
Task Gates
Trap Gates
Interrupt Gates
Hardware Checks for Interrupts and Exceptions
Linux Kernel support for Handling Interrupts - Details of do_IRQ() function, core of Interrupt Handling
This section will discuss the hardware prospective of interrupt handling from the CPU, the Linux Kernel's Interrupt Routing subsystem and Device Drivers's role in Interrupt handling.
Term Interrupt is self defined, Interrupts are signals sent to a CPU on an INTR bus (providing the connection to the CPU), issued whenever any device wants to get attention of the CPU. As soon as the interrupt signal occurs, CPU defer the current activity and service the interrupt by executing the interrupt handler corresponding to that interrupt number (also known as IRQ number).
One of the classifications of Interrupts can be done as follows: - Synchronous Interrupts (also know on as software interrupts) - Asynchronous Interrupts (also know as hardware interrupts)
The basic difference between these is that, synchronous interrupts are generated by CPU's control unit when some abnormal condition is faced; these are also know as exceptions in Intel's termenology. Synchronous interrupts are interrupts which are generated by the CPU itself, either when the CPU detects an abnormal condition or when the CPU executes some of the special instructions like 'int' or 'int3' etc. On other hand, asynchronous interupts are those, which actually are generated by the outside world (devices connected to CPU). As these interrupts can occur at any point of time, these are known asynchronous interrupts.
It's important to note that both synchronous and asynchronous interrupts are handled by the CPU on the completion of an instruction during which the interrupt occurs. Execution of a machine instruction is not done in one single CPU cycle, it will take some cycles to complete. Any interrupt occurring in between the execution of an instruction, will not be handled immediately. Rather, the CPU will handle interrupts after completion of the instruction.
There are few things we always expect the CPU to do on the occurence of the handling of an interrupt. Whenever an interrupt occurs, the CPU performs some hardware checks, required to make the system secure. Before discussing the hardware checks, we will explaining how interrupts are routed to the CPU from the hardware devices.
Details of Programmable Interrupt Controller
On Intel architecture, system devices (device controllers) are connected to a special device known as PIC (Programmable Interrupt Controller). CPUs have two lines for receiving interrupt signals: NMI and INTR. the NMI line is to recieve non-maskable interrupts; non-maskable indicates that the interrupt can not be blocked. These interrupts have the hightest priority and are rarely used. INTR line is the line on which all the interrupts from system devices are received. These interrupts can be masked (blocked). Since all the interrupt signals need to be multiplexed on a single CPU line, we need some mechanism through which interrupts from different device controllers can be routed to a single line of CPU. This routing, or multiplexing is done by PIC (Programmable Interrupt Controller). PIC sits between system devices and CPU and have multiple input lines; each line connected to a different device contoller in the system. On the other hand IPCs have only one output line which is connected to the CPU's INTR line on which it sends a signal to the CPU. There are two PIC controllers joined together and the output of the second PIC controller is connected to the second input of first PCI. This setup allows maximum of 15 input lines on which different system device controllers can be connected. PICs have some programmable registers, through which the CPU can communicate with it (give command, mask/unmask interrup lines, read status). Both PICs have their own following registers:
Mask Register
Status Register
A mask register is used to mask/unmask a specific interrupt line. CPU can ask the PIC to mask (block) the specific interrupt by setting the corresponding bit in the mask register. Unmasking can be done by clearing that bit. When a particular interrupt is being masked, PICs do receive the interrupts on its corresponding input line, but do not send the interrupt singnal to a CPU in which case the CPU keeps on doing what it was doing. When an interrupts is being masked, they are not lost, rather PIC remembers those and does send the interrupt to the CPU when the CPU unmasks that interrupt line. Masking is different from blocking all the interrupts to the CPU. CPUs can ignore all the interrupts coming on INTR line by clearing the IF (Interrupt Falg) flag in the EFLAGS register of CPU. When this bit is cleared, interrupts coming on an INTR line are simply ignored by the CPU, we can then consider it to be blocking interrupts. So now we understand that masking is done at PIC level and individual interrupt lines can be masked or unmasked, whereas blocking is done at CPU level and is done for all the interrupts coming to that CPU except for NMIs (Non-Maskable Interrupts), that are received on a NMI line of the CPU and can not be blocked or ignored.
Nowdays, interrupt architecture is not as simple as shown above: machines use the APIC (Advanced Programmable Interrupt Controller), which can support up to 256 interrupt lines. Along with APIC, every CPU also has an inbuilt IO-APIC. I won't go into details of these right now.
Once the interrupt signal is received by the CPU, the CPU performs some hardware checks for which no software machine instructions are executed. Before looking into what these checks are, we need to understand some architecture specific data structures maintained by the kernel.
The kernel needs to maintain one IDT (Interrupt Descriptor Table), which actually maps the interrupt line with the interrupt handler routine. This table has 256 entries and each entry has 8 bytes. The first 32 enteries of this table are used for exceptions and the remaining are used for hardware interrupts received from the 'outside world'. This table can contain three different type of enteries; these three different types are as follows:
Task Gates, Trap Gates and Interrupt Gates
Lets see what these gates are where these are used.
a). Task Gates
Format of task gates is as follows:
0 to 15 bits : reserved (not used)
16 to 31 bits : points to the TSS (Task State Segment) entry of the process to which we need to switch.
32 to 39 bits : these bits are reserved and are not currently used.
40 to 43 bits : specify the type of entry (its value for task gate is 0101)
44th bit : always 0, not used
45 to 46 bits : this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
47th bit : specifies if this entry is valid or not (1 - valid, 0 - invalid)
48 to 63 bits : reserved (not used)
Basically the task gates are used in IDT, to allow the user process to make a context switch with another process without requesting the kernel to do this. As soon as this gate is hit (interrupt received on line for which there is a task gate in IDT), The CPU saves the context - the state of the processors' registers - of currently running processes to the TSS of current processeses, whose address is saved in the TR (Task Register) of the CPU. After saving the context of a current process the CPU sets the CPU registers with the values stored in the TSS of a new process, whose pointer is saved in the 16-31 bits of the task gate. Once the registers are set with these new values, the processor gets the new process and the context switch is done. Linux does not use the task gates, it only uses the trap and interrupt gates in IDT. I will not explain the task gates any further.
b). Trap Gates
Format of trap gates is as follows:
0-15 bits : first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
16-31 bits : indicates the index of segment descriptor in GDT (Global Descriptor Table)
32-36 bits : these bits are reserved and are not currently used.
37-39 bits : always 000, not used
40-43 bits : specify the type of entry (its value for trap gate is 1111)
44th bit : always 0, not used
45-46 bits : this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
47th bit : specifies if this entry is valid or not (1 - valid, 0 - invalid)
48-63 bits : last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
Trap gates are basically used to handle exceptions generated by CPU. 0-15 bits and 48-63 bits together form the pointer (offset in segment identified by 16-31 bits of this entry) to a kernel function. The only difference between trap gates and interrupt gates is that, whenever an interrupt gate is hit, the CPU automatically disables the interrupts by clearing the IF flag in the CPU's EFLAG register. In case of trap gate, on the other hand, this is not done and interrupts remain enabled. As mentioned earlier, trap gates are used for exceptions, so in the Linux Kernel the first 32 enteries in the IDT are initialized with trap gates. In addition to this, the Linux Kernel also uses the trap gate for an system call entry (entry128 of IDT).
c). Interrupt Gates
Format is as follows:
0-15 bits : first 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
16-31 bits : indicates the index of segment descriptor in GDT (Global Descriptor Table)
32-36 bits : these bits are reserved and are not currently used.
37-39 bits : always 000, not used
40-43 bits : specify the type of entry (its value for interrupt gate is 1110)
44th bit : always 0, not used
45-46 bits : this specifies the DPL (Decsriptor Previlege Level) level of gate entry.
47th bit : specifies if this entry is valid or not (1 - valid, 0 - invalid)
48-63 bits : last 16 bits of a pointer to a kernel function which need to be invoked when this gate is hit
Format of interrupt gates is same as trap gates explained above,expect the value of type field (40-43 bits). In case of trap gates this has a value 1111 and in case of interrupts it has 1110.
Note: whenever the interrupt gate is hit, interrupts are disabled automaticly.
Whenever an exception or interrupt occurs, the corresponding trap/interrupt gate is hit and the CPU performs some checks with fields of these gates. Things done by the CPU are as follows:
1). get the ith entry from the IDT (the physical address and the size of an IDT is stored in the IDTR register of the CPU), here 'i' means the interrupt number.
2). read the segment descriptor index from the 16-31 bits of the IDT entry, lets call this 'n'
3). get the segment descriptor from the 'n'th entry in the GDT (the physical address and the size of an GDT is stored in the GDTR register of the CPU)
4). the DPL of the nth entry in the GDT should be less than equal to the CPL (the Current Priviledge Level, specified in the read-only lowermost two bits of the CS register). Incase DPL > CPL, the CPU will generate a general protection exception. We will discuss later what this check will mean and why this is done. In short:
general protection exception IfDPL (of GDT entry) < CPL, we are entering the higher previlege level (probably from user to kernel mode). In this case CPU switches the hardware stack (SS and ESP registers) from currently running process'suser mode stack to its kernel mode stack. We will see ahead, how this stack switch is exactly done. Note: stack switching idea has been mentioned here, but it actually happens after the 5th step mentioned below.
5). for software interrupts (generated by assembly instructions 'int'), one more check is done. This check is not performed for hardware interrupts (interrupts generated by system devices and forwarded by PIC). Simply saying:
DPL (of IDT entry) >= CPL : ok, we have permission to enter through this gate
DPL < CPL : genreal protection exception
6).switches the stack if DPL (of GDT entry) < CPL. In addition to this mode of CPU (least significant two bits of CS) is also changed from CPL to DPL (of GDT entry)
7). if the stack switch has taken place (SS and ESP registers reset to kernelstack), then pushes the oldvalues of SS and ESP (pointing to user stack) on this new stack (kernel stack)
8). pushes the EFALGS, CS and EIP registers on the stack (note: now we are working on kernel stack). This actually saves the pointer to user application instruction to which we need to return back after servicing the interrupt or exception
9). In case of exceptions, if there is any harware code, processor pushes that also on kernel stack
10). loads the CS with the value of GDT entry and EIP with the offset entry of IDT (0-15 bits + 48-63 bits)
All the above action is done by CPU hardware without the execution of any software instruction. Checks performed at step 4th and 5th (mentioned above) are important.
4th checks make sure that the code we are going to execute (Interrupt Service Routine) does not fall in a segment with lesser previlege. Obivously the ISR can not be in lesser previlege segment than that what we are into. DPL or CPL can have 4 values (0,1,2 for kernel mode and 3 fo user mode). Out of these four only two are used, that is 0 (for kernel mode) and 3 (for user mode).
5th check makes sure that application can enter the kernel mode through specific gates only, in Linux only through 128th gate entry which is for system call invocation. If we set the DPL field of IDT entry to be 0,1 or 2,application programme (running with CPL 3) cannot enter through that gate entry. If it tries, CPU will generate general protection exception. This is the reason that in Linux, DPL fields of all the IDT enteries (except 128th entry used for system call) are initialized with value '0', this makes sure only kernel code can access these gates not application code. In Linux 128th entry (used for system call) is of trap gate type and its DPL value is initialized to 3, so that application code can enter through this gate by assembly instruction"int 0x80"
Now lets see how does the stack switch happens when the DPL (of GDT entry) < CPL. CPU have TR (Task Register) register,which actually points to the TSS (Task Sate Segment) od currently running process. TSS is an architecture defined data structure which contains the stae of processor registers whenever context switch ofthis process happens. TSS include three sets of ESS and ESP fields, one for each level of processor (0,1 and 2). These fields specifies the stack to be used whenever we enter that processor level. Lets say the DPL value in GDT entry is 0, in this case, CPU will load the SS register with the value of SS field in TSS for 0 level and ESP registerwith the value of ESP field in TSS for 0 level. After loading the SS and ESP with these values, CPU starts pointing to the new kernel levelstack o current process. Old values of SS and ESP (CPU remembers them somehow) are now pushed on this new kernel level stack; this is done as we need to return back to old stack oncewe service the interrupts,exception or system call. Prudent readers must be wondering, why there is no field for level 3 stack in TSS. Well the reason for this is that we never use the CPU's stack switching mechanism to switch from higher CPU level (kernel mode - 0,1 and 2) to lower CPU level (user mode - 3).This is the reason that CPU while entering the higher level (kernel mode) saves the previously used lower level stack (user mode) on thekernel stack.
Once all this CPU action is done, CPU's CS and EIP registers are pointing to the kernel functions written for handling interrupts or exceptions. CPU simply start executing the instructions at this point (now we are in kernel mode - level 0)
As this is the software part related to handling of Interrupts and maybe interest of wider audience so I wrote this on a seperate page, please find this here.
Parent Node : Subsystems