Chapter 5. Concurrency and Race Conditions
Thus far, we have paid little attention to the problem of concurrency—i.e., what happens when the system tries to do more than one thing at once. The management of concurrency is, however, one of the core problems in operating systems programming. Concurrency-related bugs are some of the easiest to create and some of the hardest to find. Even expert Linux kernel programmers end up creating concurrency-related bugs on occasion. 到目前为止,我们很少关注并发问题--即当系统试图同时做一件以上的事情时会发生什么。然而,并发性的管理是操作系统编程的核心问题之一。并发相关的错误是一些最容易产生的,也是一些最难发现的。即使是专业的Linux内核程序员,有时也会产生与并发有关的错误。
In early Linux kernels, there were relatively few sources of concurrency. Symmetric multiprocessing (SMP) systems were not supported by the kernel, and the only cause of concurrent execution was the servicing of hardware interrupts. That approach offers simplicity, but it no longer works in a world that prizes performance on systems with more and more processors, and that insists that the system respond to events quickly. In response to the demands of modern hardware and applications, the Linux kernel has evolved to a point where many more things are going on simultaneously. This evolution has resulted in far greater performance and scalability. It has also, however, significantly complicated the task of kernel programming. Device driver programmers must now factor concurrency into their designs from the beginning, and they must have a strong understanding of the facilities provided by the kernel for concurrency management. 在早期的Linux内核中,并发性的来源相对较少。内核不支持对称多处理(SMP)系统,并发执行的唯一原因是对硬件中断的服务。这种方法提供了简单性,但是在一个对处理器越来越多的系统的性能非常看重的世界里,它不再起作用了,它坚持要求系统对事件做出快速反应。为了响应现代硬件和应用程序的需求,Linux内核已经发展到了一个同时进行更多事情的地步。这种进化带来了更大的性能和可扩展性。然而,它也使内核编程的任务变得非常复杂。设备驱动程序员现在必须从一开始就将并发性纳入他们的设计中,而且他们必须对内核提供的并发性管理设施有深刻的理解。
The purpose of this chapter is to begin the process of creating that understanding. To that end, we introduce facilities that are immediately applied to the scull driver from Chapter 3. Other facilities presented here are not put to use for some time yet. But first, we take a look at what could go wrong with our simple scull driver and how to avoid these potential problems. 本章的目的是开始建立这种理解的过程。为此,我们介绍了可立即应用于第三章中的Scull驱动的设施。这里介绍的其他设施在一段时间内还不会投入使用。但首先,我们看一下我们的简单的scull驱动可能会出什么问题,以及如何避免这些潜在的问题。
Pitfalls in scull
Let us take a quick look at a fragment of the scull memory management code. Deep down inside the write logic, scull must decide whether the memory it requires has been allocated yet or not. One piece of the code that handles this task is: 让我们快速看一下Scull的内存管理代码的一个片段。在写逻辑的深处,scull必须决定它所需要的内存是否已经被分配。处理这一任务的代码的一个片段是。
if (!dptr->data[s_pos]) {
dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
if (!dptr->data[s_pos])
goto out;
}
Suppose for a moment that two processes (we'll call them "A" and "B") are independently attempting to write to the same offset within the same scull device. Each process reaches the if test in the first line of the fragment above at the same time. If the pointer in question is NULL, each process will decide to allocate memory, and each will assign the resulting pointer to dptr->data[s_pos]. Since both processes are assigning to the same location, clearly only one of the assignments will prevail. 假设有两个进程(我们称它们为 "A "和 "B")独立地试图写到同一个Scull设备中的同一个偏移。每个进程在同一时间到达上面片段第一行的if测试。如果有问题的指针是NULL,每个进程将决定分配内存,并且每个进程将把得到的指针分配给dptr->data[s_pos]。由于两个进程都在向同一个位置赋值,显然只有其中一个赋值会占优势。
What will happen, of course, is that the process that completes the assignment second will "win." If process A assigns first, its assignment will be overwritten by process B. At that point, scull will forget entirely about the memory that A allocated; it only has a pointer to B's memory. The memory allocated by A, thus, will be dropped and never returned to the system. 当然,将会发生的是,第二个完成赋值的进程将会 "赢"。如果进程A首先分配,它的分配将被进程B覆盖。在这一点上,Scull将完全忘记A分配的内存;它只有一个指向B的内存的指针。因此,由A分配的内存将被丢弃,并且永远不会返回到系统中。
This sequence of events is a demonstration of a race condition. Race conditions are a result of uncontrolled access to shared data. When the wrong access pattern happens, something unexpected results. For the race condition discussed here, the result is a memory leak. That is bad enough, but race conditions can often lead to system crashes, corrupted data, or security problems as well. Programmers can be tempted to disregard race conditions as extremely low probability events. But, in the computing world, one-in-a-million events can happen every few seconds, and the consequences can be grave. 这一连串的事件就是一个竞争条件的表现。竞争条件是对共享数据不受控制的访问的结果。当错误的访问模式发生时,会产生一些意外的结果。对于这里讨论的竞争条件,其结果是内存泄漏。这已经很糟糕了,但竞争条件往往会导致系统崩溃、数据被破坏或安全问题。程序员可能会被诱惑,把竞争条件当作极低的概率事件而不予理会。但是,在计算机世界中,百万分之一的事件可能每几秒钟就发生一次,而且后果可能很严重。
We will eliminate race conditions from scull shortly, but first we need to take a more general view of concurrency. 我们很快就会从Scull中消除竞争条件,但首先我们需要对并发性有一个更普遍的看法。