http://www.linuxforum.net/forum/gshowflat.php?Cat=&Board=linuxK&Number=428239&page=5&view=collapsed&sb=5&o=all&fpart=
注: 这里的barrier 指的是wmb, rmb, mb.
一 直找不到合适的资料说明barrier和 Cache coherence 之间的关系. 在<
迷惑于这种使用barrier实现的无锁操作. 另外的例子就是big read lock(brlock.c brlock.h). 找不到一些理由或者条件, 指出必须使用barrier的情景. 还有就是softirq.c 中的void init_bh(int nr, void (*routine)(void)), 也使用了barrier.
应该是这样: barrier强制cpu实现strong order而cache conherence 关注memory在多个cpu的cache中的copy的一致性问题. 问题是,
cache conhernece需要barrier的参与吗?
引起cpu reorder内存读写的技术有那些? Write buffer算一个, 但是数据在writer buffer的时候cache是否是一致的?(应该是一致的, 如果是这样,说明barrier和cache conherence根本就是两码事, cache conherence 对软件完全透明但是barrier需要软件的参与)
我想在下面的情景下需要考虑使用barrier: 两个cpu都会操作同一个数据(写的情形需要互斥), 或者读写两个数据, 但是这两个数据有某种关系, 比如根据一个数据的值决定另一数据的操作.
这个描述很不让人满意. 也纯粹是推理(当然也有道理, per cpu的data永远不用barrier, reoreder 的问题不影响单个cpu.(fix me)).
另外, 涉及到使用Barrier 的时候, 有的书籍说: 让改变立即对所有的cpu可见. 这种说法也许不妥, 我们应该怎样理解这个问题?
这里为什么有mb()
我的理解和你差不多:
1.cache coherence不需要barrier参与,完全由硬件协议保证cache coherence.
2. 引起cpu reorder的技术还有: load forwarding和load scheduling.
3. 只有涉及到至少两个内存地址和两个对内存访问的功能单元(CPU或外设)时才有memory ordering问题。
4. “让改变立即对所有的cpu可见”不妥, 应该是让两个内存操作的被其它CPU可见的顺序符合某种要求(release, acquire or fence)。
reoreder 的问题也会影响单cpu。
上学期学的一门《高级计算机系统结构》课程就讲过一个经典例子。
PowerPC机的store指令紧跟着一个load指令就会发生颠倒顺序的执行。对于下述程序,如果不用memory barrier,就会发生错误:
1 while ( TDRE == 0 );
2 TDR = char1;
3 asm("eieio"); // memory barrier
4 while ( TDRE == 0 );
5 TDR = char2;
程序说明:假设上述的TDRE是某外设的状态寄存器,为0表示外设忙,为1表示外设可以接收一个字符并处理,这时用户可向外设的TDR寄存器写入一个待处理的字符,写入之后TDRE变为0,外设处理字符需要一段时间,在这段时间过去后TDRE又从0变为1。
假 设将上述程序的第三行去掉,那么程序在执行第一行时会等到TDRE为1时继续向下执行,然而后面有一条store指令(TDR = char1;)之后紧跟一条load指令( while ( TDRE == 0 ); ),这时第4句中的load指令会先执行,然后再执行第二行的store指令,这样第四行load出来的寄存器值肯定为1,这样就会立刻执行第五行,结果 造成外设在忙的状态下接收到第二个字符,这样肯定会出错。
所以必须叫加入第三行,以确保store指令在load指令前执行。
reoreder所引起的这个外设(非smp cpu)的问题是很好理解的, 关键是对smp环境的程序设计产生的影响不那么直观.
如果我们的想法正确. 就应该看看init_bh, ksoftirqd的问题了. 但是好像不太直观啊.
我也急迫地想了解这方面的东西。
另外ldd的中断一章也看不懂。
This code, though simple, represents the typical job of an interrupt handler. It, in
turn, calls short_incr_bp, which is defined as follows:
static inline void short_incr_bp(volatile unsigned long *index,
int delta)
{
unsigned long new = *index + delta;
barrier (); /* Don’t optimize these two together */
这里为什么用barrier?
*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}
This function has been carefully written to wrap a pointer into the circular buffer
without ever exposing an incorrect value. By assigning only the final value and
placing a barrier to keep the compiler from optimizing things, it is possible to
manipulate the circular buffer pointers safely without locks.
高手指点。
在linux 中所有的原子操作都是带有mb的, 并且带有barrier(防止gcc对指令进行reorder ). 对应x86 , 就是像 test_and_setbit之类的操作都是以 volatile /lock /:memory的方式实现的.
原因也是明显的:
spin_lock(lock);
some read/write
………..
如果没有mb, 有可能 some read/write 会跑到 spin_lock 之前去执行, 这当然是不容许的.
这是一篇说明实现 lock free searching的文章, 对理解barrier很有裨益, 研究它远比翻译它有价值
Data Dependencies and wmb()
Version 1.0
Goal
Support lock-free algorithms without inefficient and ugly read-side code.
Obstacle Some CPUs do not support synchronous invalidation in hardware.
Example Code
Insertion into an unordered lock-free circular singly linked list,
while allowing concurrent searches.
Data Structures
The data structures used in all these examples are
a list element, a header element, and a lock.
Search and Insert Using Global Locking
The familiar globally locked implementations of search() and insert() are as follows:
These implementations are quite straightforward, but are subject to locking bottlenecks.
Search and Insert Using wmb() and rmb()
The existing wmb() and rmb() primitives can be used to do lock-free insertion. The
searching task will either see the new element or not, depending on the exact timing,
just like the locked case. In any case, we are guaranteed not to see an invalid pointer,
regardless of timing, again, just like the locked case. The problem is that wmb() is
guaranteed to enforce ordering only on the writing CPU --
the reading CPU must use rmb() to keep the ordering.
The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the
i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In
addition, code that traverses a chain of pointers would have to be broken up in order to
insert the needed rmb()s. For example:
One could imagine much uglier code expansion where there are more dereferences in a
single expression. The inefficiencies and code bloat could be avoided if there were a
primitive like wmb() that allowed read-side data dependencies to act as implicit rmb()
invocations.
Why do You Need the rmb()?
Many CPUs have single instructions that cause other CPUs to see preceding stores before
subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency
forces the ordering.
However, some CPUs have no such instruction, the Alpha being a case in point. On these
CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes
will be emitted in order. The wmb() does not guarantee that the reading CPU will process
these invalidates in order.
For example, consider a CPU with a partitioned cache, as shown in the following diagram:
Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache
lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0,
and that a newly allocated element was maintained in cache bank 1. The insert() code's
wmb() would guarantee that the invalidates corresponding to the writes to the next, key,
and data fields would appear on the bus before the write to head->next.
But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending
invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle.
The invalidation corresponding to head->next could then be processed before that of the
three fields. If search() were to be executing just at that time, it would pick up the
new value of head->next, but, since the invalidates corresponding to the three fields
had not yet been processed, it could pick up the old (garbage!) value corresponding to
these fields, possibly resulting in an oops or worse.
Placing an rmb() between the access to head->next and the three fields fixes this
problem. The rmb() forces all outstanding invalidates to be processed before any
subsequent reads are allowed to proceed. Since the invalidate corresponding to the three
fields arrived before that of head->next, this will guarantee that if the new value of
head->next was read, then the new value of the three fields will also be read.
No oopses (or worse).
However, all the rmb()s add complexity, are easy to leave out, and hurt performance of
all architectures. And if you forget a needed rmb(), you end up with very intermittent
and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux!
So, there is strong motivation for a way of eliminating the need for these rmb()s.
Solutions for lockfree search and insertions
Search and Insert Using wmbdd()
It would much nicer (and faster, on many architectures) to have a primitive similar to
wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb().
It is possible to do this (see patch). With such a primitive, the code looks as follows:
This code is much nicer: no rmb()s are required, searches proceed
fully in parallel with no locks or writes, and no intermittent data corruption.
Search and Insert Using read_barrier_depends()
Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on
Alpha, and a nop on other architectures. This removes the read-side performance
problem for non-Alpha architectures, but still leaves the read-side
read_barrier_depends(). It is almost possible for the compiler to do a good job of
generating these (assuming that a "lockfree" gcc struct-field attribute is created
and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock
is held. (If the lock is held, the read_barrier_depends() calls should not be generated.)
After discussions in lkml about this, it was decided that putting an explicit
read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also
suggested that the barrier names be made more explict. With such a primitive,
the code looks as follows:
A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this
patch can be found along with the RCU package here.
Other Approaches Considered
Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s.
Although wmbdd()'s semantics are much more intuitive, there are a number of uses of
wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening
the semantics would incur unnecessary overhead on many CPUs--or require many changes to
the code, and thus a much larger patch.
Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There
are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP)
will be manufacturing new Alphas for quite a few years to come. Microsoft would likely
focus quite a bit of negative publicity on Linux's dropping support for anything (never
mind that they currently support only two CPU architectures). And the code to make Alpha
work correctly is not all that complex, and it does not impact performance of other CPUs.
Besides, we cannot be 100% certain that there won't be some other CPU lacking a
synchronous invalidation instruction...