原文地址http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html
We mention the phrase Mechanical Sympathy quite a lot, in fact it's even Martin's blog title. It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.
我们没少提到Mechanical Sympathy,事实上它甚至是Martin博客的标题。关于了解底层硬件如何运转和以协同而不是相抵的方式编程。
We get a number of comments and questions about the mysterious cache line padding in theRingBuffer, and I referred to it in the last post. Since this lends itself to pretty pictures, it's the next thing I thought I would tackle.
我们收到一些关于RingBuffer中神秘的高速缓存块补全的评论和疑问,我在上一篇文章中已经提到它了。由于这个适合漂亮的图片,我想这是下一件我该解决的事了。
Comp Sci 101
计算机科学101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something. So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that. But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.
我爱在LMAX工作的原因之一就是诸如我从大学和A Level Computing所学之类的东西实际上还是有意义的。做为一个开发者你可以不了解CPU,数据结构或者大O符号是如此常见-我用了10年的职业生涯来忘记这些东西。但是现在看来如果你知道这些知识,并且应用它们,你能写出一些非常巧妙和快速的代码。
So, a refresher for those of us who studied this at school, and an intro for those who didn't. Beware - this post contains massive over-simplifications.
因此,作为对我们这些在学校里学过的人的进修,和对那些没有学过的人的介绍。当心-这篇文章包含大量的过度简化。
The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program. Main memory (RAM) is where your data (including the lines of your program) lives. We're going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.
CPU是机器的心脏和最终做所有运算(执行你的程序)的东西。主内存(RAM)是你的数据(包括你的程序)生存的地方。在这里我们将忽略硬件驱动和网络之类的东西因为Disruptor的目标是在内存中尽可能多地运行。
The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow. If you're doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it's performing the operation (think a loop counter - you don't want to be going off to main memory to fetch this to increment it every time you loop around).
CPU和主内存之间有好几层缓存,因为即使访问主内存也太慢。如果你正在对一块数据做多次相同的运算,那么在执行运算的时候把它加载到离CPU很近的地方就有意义了(想一个循环计数-你不想每次循环都跑到主内存去取这个数据来增长它)。
The closer the cache is to the CPU, the faster it is and the smaller it is. L1 cache is small and very fast, and right next to the core that uses it. L2 is bigger and slower, and still only used by a single core. L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket. Finally you have main memory, which is shared across all cores and all sockets.
离CPU越近,缓存越快,也越小。缓存L1很小很快,并且紧靠在使用它的内核边上。L2大一些,也慢一些,并且仍然只由单个内核使用。L3在现代多核机器中更普遍,仍然更大,更慢,由内核之间通过单个socket共同使用。
When the CPU is performing an operation, it's first going to look in L1 for the data it needs, then L2, then L3, and finally if it's not in any of the caches the data needs to be fetched all the way from main memory. The further it has to go, the longer the operation will take. So if you're doing something very frequently, you want to make sure that data is in L1 cache.
当CPU执行运算的时候,它先去L1查找所需的数据,再去L2,然后是L3,最后如果这些缓存中都没有,所需的数据就要去主内存拿。去得越远,运算耗费的时间就越长。所以如果你在做一些很频繁的事,你要确保数据在缓存L1中。
Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:
Martin和Mike的 QCon presentation演讲给出了一些高速缓存未命中的成本的指示数据。
Latency from CPU to... | Approx. number of CPU cycles |
Approx. time in nanoseconds |
Main memory | ~60-80ns | |
QPI transit (between sockets, not drawn) |
~20ns | |
L3 cache | ~40-45 cycles, | ~15ns |
L2 cache | ~10 cycles, | ~3ns |
L1 cache | ~3-4 cycles, | ~1ns |
Register | 1 cycle |
long
is 8 bytes, so in a single cache line you could have 8
long
variables.
现在注意有趣是它存在缓存中时不是独立的项-比如它不是一个单独的变量,单独的指针。高速缓存是由缓存块组成的,通常64字节,并且有效地引用主内存中的地址。一个Java的long类型是8字节,因此在一个缓存块中,你可以有8个long类型的变量。
long
isn't part of an array. Imagine it's just a single variable. Let's call it
head
, for no real reason. Then imagine you have another variable in your class right next to it. Let's arbitrarily call it
tail
. Now, when you load
head
into your cache, you get
tail
for free.
不过,这种免费加载也有一个缺陷。设想你的long数据不是数组的一部分。设想它只是单独的一个变量。让我们称它为“头”,没什么理由。然后再设想在你的类中有另一个变量紧挨着它。让我们直接称它为“尾”。现在,当你加载"头"到高速缓存的时候,你免费加载了“尾”。
Which sounds fine. Until you realise that tail
is being written to by your producer, and head
is being written to by your consumer. These two variables aren't actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.
听想来不错。直到你意识到“尾”正在被你的生产者写入,而“头”正在被你的消费者写入。这两个变量实际上并不是密切相关的,而且事实上是要被两个可能在两个不同的内核中运行的不同线程使用的。
Imagine your consumer updates the value of head
. The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value. And remember that we deal with the level of the whole line, we can't just mark head
as being invalid.
设想你的消费者更新了“头”。缓存中的值被更新了,内存中的值被更新了,而其他任何缓存块中存在的“头”都失效了因为其他缓存不会有崭新的值。记住我们是在整个块级处理的,我们没法只把“头”标记为无效。
tail
, the whole cache line needs to be re-read from main memory. So a thread which is nothing to do with your consumer is reading a value which is nothing to do with
head
, and it's slowed down by a cache miss.
head
you get
tail
too, and every time you access
tail
, you get
head
as well. All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that's going to be very inefficient for concurrent access.
你会看到Disruptor消除这个问题,至少对于高速缓存大小是64位或更少的结构是这样的,通过增加补丁来确保ring buffer的序列号不会和其他东西同时存在于一个高速缓存块。
publiclong p1, p2, p3, p4, p5, p6, p7;// cache line padding |
privatevolatilelong cursor = INITIAL_CURSOR_VALUE; |
publiclong p8, p9, p10, p11, p12, p13, p14;// cache line padding |
Entry
classes too - if you have different consumers writing to different fields, you're going to need to make sure there's no false sharing between each of the fields.
修改:Martin写了一个从技术上来说更准确更详细的关于伪共享的文章,并且发布了性能结果。