Concurrency and Its Management
In a modern Linux system, there are numerous sources of concurrency and, therefore, possible race conditions. Multiple user-space processes are running, and they can access your code in surprising combinations of ways. SMP systems can be executing your code simultaneously on different processors. Kernel code is preemptible; your driver's code can lose the processor at any time, and the process that replaces it could also be running in your driver. Device interrupts are asynchronous events that can cause concurrent execution of your code. The kernel also provides various mechanisms for delayed code execution, such as workqueues, tasklets, and timers, which can cause your code to run at any time in ways unrelated to what the current process is doing. In the modern, hot-pluggable world, your device could simply disappear while you are in the middle of working with it. 在一个现代的Linux系统中,有许多并发的来源,因此,有可能出现竞争条件。多个用户空间进程正在运行,它们可以以惊人的组合方式访问你的代码。SMP系统可以在不同的处理器上同时执行你的代码。内核代码是可抢占的;你的驱动程序的代码可以在任何时候失去处理器,而取代它的进程也可能在你的驱动程序中运行。设备中断是异步事件,可以引起你的代码的并发执行。内核还提供了各种延迟代码执行的机制,如工作队列、tasklets和计时器,它们可以使你的代码在任何时候以与当前进程所做的事情无关的方式运行。在现代的、可热插拔的世界里,你的设备可以在你正在使用它的时候简单地消失。
Avoidance of race conditions can be an intimidating task. In a world where anything can happen at any time, how does a driver programmer avoid the creation of absolute chaos? As it turns out, most race conditions can be avoided through some thought, the kernel's concurrency control primitives, and the application of a few basic principles. We'll start with the principles first, then get into the specifics of how to apply them. 避免竞争条件可能是一项令人生畏的任务。在一个任何事情都可能在任何时候发生的世界里,一个驱动程序员如何避免产生绝对的混乱?事实证明,通过一些思考、内核的并发控制原语和一些基本原则的应用,大多数竞争条件是可以避免的。我们先从这些原则开始,然后讨论如何应用这些原则的具体细节。
Race conditions come about as a result of shared access to resources. When two threads of execution[1] have a reason to work with the same data structures (or hardware resources), the potential for mixups always exists. So the first rule of thumb to keep in mind as you design your driver is to avoid shared resources whenever possible. If there is no concurrent access, there can be no race conditions. So carefully-written kernel code should have a minimum of sharing. The most obvious application of this idea is to avoid the use of global variables. If you put a resource in a place where more than one thread of execution can find it, there should be a strong reason for doing so. 竞争条件是由于对资源的共享访问而产生的。当两个执行线程[1]有理由使用相同的数据结构(或硬件资源)时,总是有可能发生混战。因此,在你设计驱动程序时要记住的第一条经验法则是尽可能避免共享资源。如果没有并发的访问,就不可能有竞赛条件。因此,精心编写的内核代码应该有最低限度的共享。这个想法的最明显的应用是避免使用全局变量。如果你把一个资源放在一个不止一个执行线程可以找到的地方,应该有一个强有力的理由这样做。
The fact of the matter is, however, that such sharing is often required. Hardware resources are, by their nature, shared, and software resources also must often be available to more than one thread. Bear in mind as well that global variables are far from the only way to share data; any time your code passes a pointer to some other part of the kernel, it is potentially creating a new sharing situation. Sharing is a fact of life. 然而,事实是,这种共享往往是需要的。硬件资源在本质上是共享的,而软件资源也必须经常提供给一个以上的线程使用。请记住,全局变量远不是共享数据的唯一方式;任何时候你的代码将一个指针传递给内核的其他部分,它都有可能创造一个新的共享情况。共享是生活中的一个事实。
Here is the hard rule of resource sharing: any time that a hardware or software resource is shared beyond a single thread of execution, and the possibility exists that one thread could encounter an inconsistent view of that resource, you must explicitly manage access to that resource. In the scull example above, process B's view of the situation is inconsistent; unaware that process A has already allocated memory for the (shared) device, it performs its own allocation and overwrites A's work. In this case, we must control access to the scull data structure. We need to arrange things so that the code either sees memory that has been allocated or knows that no memory has been or will be allocated by anybody else. The usual technique for access management is called locking or mutual exclusion—making sure that only one thread of execution can manipulate a shared resource at any time. Much of the rest of this chapter will be devoted to locking. 这里是资源共享的硬性规定:任何时候,如果一个硬件或软件资源的共享超出了单个执行线程的范围,并且存在一个线程对该资源的看法不一致的可能性,你必须明确地管理对该资源的访问。在上面的例子中,进程B对情况的看法是不一致的;在不知道进程A已经为(共享)设备分配了内存的情况下,它执行了自己的分配,并覆盖了A的工作。在这种情况下,我们必须控制对scull数据结构的访问。我们需要安排一些事情,使代码要么看到已经分配好的内存,要么知道没有内存已经或将要被其他人分配。通常的访问管理技术被称为锁定或互斥,以确保在任何时候只有一个执行线程可以操作一个共享资源。本章剩下的大部分内容都是关于锁定的。
First, however, we must briefly consider one other important rule. When kernel code creates an object that will be shared with any other part of the kernel, that object must continue to exist (and function properly) until it is known that no outside references to it exist. The instant that scull makes its devices available, it must be prepared to handle requests on those devices. And scull must continue to be able to handle requests on its devices until it knows that no reference (such as open user-space files) to those devices exists. Two requirements come out of this rule: no object can be made available to the kernel until it is in a state where it can function properly, and references to such objects must be tracked. In most cases, you'll find that the kernel handles reference counting for you, but there are always exceptions. 然而,首先,我们必须简要地考虑另外一条重要的规则。当内核代码创建了一个将与内核的任何其他部分共享的对象时,这个对象必须继续存在(并正常工作),直到它被知道没有外部对它的引用存在。当Scull使其设备可用时,它必须准备好处理对这些设备的请求。而且,Scull必须继续能够处理其设备上的请求,直到它知道不存在对这些设备的引用(例如开放的用户空间文件)。这条规则有两个要求:在内核处于可以正常工作的状态之前,任何对象都不能被提供给内核,而且对这些对象的引用必须被跟踪。在大多数情况下,你会发现内核为你处理引用计数,但总有例外。
Following the above rules requires planning and careful attention to detail. It is easy to be surprised by concurrent access to resources you hadn't realized were shared. With some effort, however, most race conditions can be headed off before they bite you—or your users. 遵循上述规则需要计划和对细节的仔细关注。你很容易对你没有意识到的共享资源的并发访问感到惊讶。然而,通过一些努力,大多数竞赛条件可以在它们咬你或你的用户之前就被解决掉。