Some History
Earliest computers were very pure von Neumann machines: all IO had to go through CPU. No notion of DMA, etc.
IBM introduced idea of hardware ``channels'' to manage IO. Switch between CPU, devices, memory. Probably earliest example of parallel processing.
For a long time, the most reasonable way to distinguish between a ``minicomputer'' and a ``mainframe'' was by whether or not there were dedicated IO and memory busses, or if everything plugged into a single bus. Advantage of former system is speed; memory bus doesn't have to worry about arbitration, so memory accesses can be faster. Advantages of latter system are cost and uniformity.
Today, the old minicomputer architecture is pretty much completely obsolete. If we look at a modern desktop computer, we can see that its bus structure looks a lot like the old-time mainframes.
This figure (which comes from an AMD document from 2001) is pretty obsolete today, but I like it because it does a good job showing the interrelationships between all the various busses that are in use today, even though the functiosn are migrating between the components over time. Let's go over some of the items in the figure...
Notice there are two of them: a 66 MHz, 64-bit PCI between the North- and Southbridge, and a 33 MHz, 32-bit PCI hanging off the Southbridge. PCI is being replaced by PCI-Express. The original PCI specification was for a 33MHz, 32-bit bus; this has been extended both by doubling the speed and the width. This chipset supports both the older, slower PCI and the newer, faster one.
In general, the extended PCIs haven't really taken off. While faster, they weren't compellingly faster enough to warrant switching. At this point, PCI is being replaced by PCI-Express, a scalable serial interconnect.
Putting everything on a single bus leads immediately to a uniform model of accessing memory and IO devices: just put devices in the memory space. Question: are you better off losing opcodes to IO instructions, or space to IO devices? When minicomputers had 16 bit address spaces, this was a valid question! Today, with memory spaces which are huge in comparison to the number of IO ports required to handle devices, taking advantage of the richness of the regular instruction set to deal with device access only makes sense.
I used to argue at this point that using memory-mapped IO made it easier to write device drivers in C, since you could map devices to C structures. Unfortunately, modern C compilers pad structures for performance reasons, and trying to coerce the compiler into producing the memory layout you really want is non-portable and deprecated. So I'm leaving the argument in place, but it's really for historical interest at this point. The comments about trying to create in
and out
instructions do remain valid, however.
Also, when writing device drivers in C, it's a lot easier to work with when devices are in memory space. Let's suppose you have a simple device (for concreteness, let's look at the A/D convertor on an HC11. It's controlled by five registers (not counting the OPTION register), located at addresses $1030-$1034. We can define a struct that looks like this:
struct ad {
unsigned char adctl;
unsigned char adr1 __attribute__ ((packed));
unsigned char adr2 __attribute__ ((packed));
unsigned char adr3 __attribute__ ((packed));
unsigned char adr4 __attribute__ ((packed));
} *adcon;and define some macroes:
#define CCF 0x80
#define SCAN 0x20
#define MULT 0x10Now, in our code, we can say
adcon = (struct ad *) 0x1030;and we can control the device by saying things like
adcon->adctl = SCAN | 3;and look at the state of the device by saying things like
while (!(adcon->adctl & CCF));
On the other hand, trying to generate IO instructions for Intel in C is... bizarre. The macroes doing it for Linux are in a file called /usr/include/asm/io.h
; you're welcome to take a look at /usr/include/asm/io.h if you want to figure this stuff out. Here's a relevant comment from the file:
* This file is not meant to be obfuscating: it's just complicated
* to (a) handle it all in a way that makes gcc able to optimize it
* as well as possible and (b) trying to avoid writing the same thing
* over and over again with slight variations and possibly making a
* mistake somewhere.
And an old comment from an older version of the code:
/*
* Talk about misusing macros..
*/
Just as a sample of what they're talking about, here's a macro definition from the file:
#define __BUILDIO(bwl,bw,type) /
static inline void out##bwl##_quad(unsigned type value, int port, int quad) { /
if (xquad_portio) /
write##bwl(value, XQUAD_PORT_ADDR(port, quad)); /
else /
out##bwl##_local(value, port); /
} /
static inline void out##bwl(unsigned type value, int port) { /
out##bwl##_quad(value, port, 0); /
} /
static inline unsigned type in##bwl##_quad(int port, int quad) { /
if (xquad_portio) /
return read##bwl(XQUAD_PORT_ADDR(port, quad)); /
else /
return in##bwl##_local(port); /
} /
static inline unsigned type in##bwl(int port) { /
return in##bwl##_quad(port, 0); /
}
Good luck.
We can classify IO devices, and IO programming techniques, according to the degree to which we can off-load the IO processing to the device:
Sampling Always valid data, CPU can read whenever it wants
Polling CPU must query device to see whether it is ready
Interrupts Device informs CPU that it is ready
DMA Device is able to control transfer of data to/from memory; requests interrupt when it's done
IO Controllers Device performs a series of IO operations without intervention, requests interrupt when it's done IO coprocessors Device is a separate, fully programmable computer
As we move down the list, we have progressively less work for the CPU to do, and more sophistication required by the device (with a correspondingly greater level of difficulty for the programmer). Correct tradeoff varies by device.
With the exception of sampling, these forms of IO are typically supersets (so an IO controller will also use DMA. A device that does DMA also requests interrupts. You can poll a device that is capable of doing interrupts).
Examples of sampling, polling, and interrupts are present on HC11.
Sampling: digital input port, motor port In these simple devices, the device is always ready to accept or to provide data, as appropriate. The interface is extremely simple, consisting of just a data register.
Polling: analog port (though analog port can be programmed to go into a mode such that, once data has gone valid, you can sample). Polling requires that you not only have a way to read and write data (the data register), but also ways to control the device and to determine its status. These are provided by command and status registers; frequently, they are combined into a single command/status register: when you read it, you get the status register; when you write it, you write to the command register. Frequently in memory-mapped systems, the CSR is implemented so the bits are compatible; you can do operations such as ``oring'' a bit in with the CSR contents and have the result be something meaningful.
In the case of the HC11 analog port, the CSR is at address 0x1030. Here, the CCF flag is a ``done'' bit. CPU can keep checking CCF; when it goes to 1, CPU knows that valid data is available. No reason to bother with interrupts on this device, since it takes exactly 32 cycles to complete a conversion; starting the process actually starts 4 conversions, so CCF always goes to 1 in 128 cycles. Too fast for interrupts to help us. For that matter, since the time is deterministic, there's no particular reason it should have given us the CCF flag (except that it's easier than counting).
Interrupts: serial port A flag to tell us when data is available in the input port, and a flag to tell us when the output port can take data. If either of these flags goes true, device signals an interrupt. Notice that input and output are logically separate devices that share an address; which of these devices is responsible for an interrupt is up to us to discover.
The new wrinkle here is that the device can request service from the CPU when it's ready. Extra bits required here are normally some way of globally controlling whether interrupts are enabled (or controlling interrupts for sets of devices determined by priority), and individual control of whether a specific device can request an interrupt.
When an interrupt occurs, the necessary steps are:
The device finishes some task, and requires CPU service. If its interrupts are enabled it will go on to the next step, otherwise it will stop here (and, normally, not request an interrupt. Though some devices will remember they want an interrupt, and request it if their interrupts ever become enabled).
The device requests service from the CPU. There is typically some handshaking during which the CPU determines whether interrupts are globally enabled, the device identifies itself, and the CPU determines whether the device is permitted to request an interrupt at the moment. The details of these tests vary widely. If the device is permitted to interrupt, we go on to the next step; if not, we wait here. In this case, if interrupts from the device class are ever enabled, the pending interrupt will be serviced.
The CPU saves enough of its prior state to recover the former computation, changes to kernel mode, and branches to a location determined by the interrupt. The interrupt service routine is located at this location.
This is the last step in the interrupt request/service operation. At this point, the problem is turned over to software. Return from the interrupt service routine is normally performed by some sort of ``return from interrupt'' instruction, which restores the previous state of the computation.
It's important to understand just what's meant by the ``previous state of the computation'' -- it must be possible for the process that was running at the time of the interrupt to resume with no impact on the process. You have to be able to interrupt between the setting of condition codes and the execution of a branch instruction that makes use of them, for instance. Occasionally, processors have instructions that take so long to execute that it's necessary to be able to interrupt the instruction itself, and then resume that same instruction later (block move instructions, which move a large amount of data from one location to another in memory, tend to be in this category). These instructions typically maintain their intermediate states in registers as they proceed.
As you can imagine, this is a particular problem with processors that use out-of-order execution. Intel has devoted a lot of resources in their processors to it; it's the whole reason for the in-order retirement buffer. IBM used a scheme they called ``imprecise interrupts,'' which meant that the saved PC would be ``near'' where the interrupt happened. This was acceptable for device interrupts, but made debugging program exceptions very difficult. CDC's CPU didn't do interrupts (IO was handled by peripheral processors, to be described later), but faced much the same problem in their context swap instruction.
One last thing to notice is that interrupts become a very expensive operation for deeply pipelined and out-of-order processors. It's substantially worse than a branch penalty; fortunately, interrupts are much less common than branches.
The key feature that makes interrupts the desired solution for a device is for an operation performed by the device to take long enough that requiring the software to check on it periodically would result in an unacceptable overhead.
More sophisticated IO mechanisms are present on other systems.
DMA A system with DMA will normally have, in addition to the command and status registers, an address and a byte-count register. This is appropriate for situations in which relatively large blocks of data must be transferred between the memory and the device. Consider transferring a buffer from the memory to a disk drive: the CPU must inform the disk as to where on the disk to put the data, and then loads the DMA controller with the starting address of the data to be sent and the count of the number of bytes. Now the DMA controller can pull the data out of memory without interfering with the CPU; after the transfer is done, it requests an interrupt.
IO Processors
IBM is the company most commonly associated with IO processors, which they call ``channels.'' A channel is a very simple computer, capable of executing a small set of general-purpose instructions along with many special-purpose IO instructions such sa ``read a track from the disk'' or ``skip forward two blocks on the tape.'' The CPU would construct a sequence of instructions for the channel to perform (a channel control program) in main memory, and would send the channel a start signal. The channel would execute the entire channel control program before interrupting the CPU.
IBM mainframe CPUs would frequently show utilization that would seem completely unacceptable to us today -- half of their time, or more, in the OS. But that was really OK, because what they were doing was more managing the IO than executing user code: doing a corporate payroll takes remarkably little processing in comparison to the huge amount of IO involved, and IO is what mainframes are all about.
Peripheral Processing Units
The most extreme case of off-loading the IO task from the CPU could be found in CDC and Cray computers. Here, all IO is performed by a front-end computer, and none by the main CPU. In the case of the CDC 6600, there were actually 10 IO computers (CDC referred to these as Peripheral Processing Units, in contrast to the Central Processing Unit); a program would request service by placing a code word in a known location. The IO computers performed all IO, and were also capable of executing a special instruction (called an exchange jump) that would cause the CPU to save its entire state and then load up a new state - effectively, causing it to perform a context switch. CDC actually ran the operating system itself on one of the IO computers.
In the case of the Cray 1, the front-end computers were purchased from DEC or Data General.
One last thing: in the current environment, attempting to classify devices according to this crisp scheme is frequently very difficult. Probably the best example is current disk drives, which appear to the CPU as simple DMA-driven devices: The CPU tells the drive which logical block to write to or read from, the drive does it, and the drive requests an interrupt. But (1) decoding the logical block address into an actual location on the disk is quite complex, and (2) the disk actually caches reads and writes so that a read occurring shortly after a write doesn't actually require a disk access, and the disk drive does its own scheduling of the reads and writes. So it can almost be classified as an IO processor.
Likewise, modern graphics cards do far more of the rendering, hidden surface calculation, and other graphics operations than the CPU does (in fact, last I heard typical GPUs had more transistors than typical CPUs!). The CPU hands the list of polygons and information about their characteristics to the graphics card, and just lets 'er rip.