http://zeromq.org/blog:multithreading-magic
多核下的困境,目前没有好的软件设计方法和工具可以处理多核编程,然后引出zeromq这个玩意儿。
In this article Pieter Hintjens and Martin Sustrik examine the difficulties of building concurrent (multithreaded) applications and what this means for enterprise computing. The authors argue that a lack of good tools for software designers means that neither chip vendors nor large businesses will be able to fully benefit from more than 16 cores per CPU, let alone 64 or more. They then examine an ideal solution, and explain how the ØMQ framework for concurrent software design is becoming this ideal solution. Finally they explain ØMQ's origins, and the team behind it.
C和C++都不提供并发支持,程序员们都要自己造轮子,java,python等虽然支持,但是不优雅,ruby就不用说了,它的解释器就有几十种,各种线程库。erlang,这语言有毒!
The most widely used languages, C and C++, do not offer any support for concurrency. Programmers roll their own by using threading APIs. Languages that do support concurrency, such as Java, Python, .NET, and Ruby, do it in a brute-force manner. Depending on the implementation - there are over a dozen Ruby interpreters, for example - they may offer "green threading" or true multithreading. Neither approach scales, due to reliance on locks. It is left to exotic languages like Erlang to do it right. We'll see what this means in more detail later.
哈哈,只能说erlang这个语言太难了。
I could end this article by telling everyone to just switch to Erlang but that's not a realistic answer. Like many clever tools, it works only for very clever developers. But most developers - including those who more and more need to produce multithreaded applications - are just average.
A technical article from Microsoft titled "Solving 11 Likely Problems In Your Multithreaded Code" demonstrates how painful the state of the art is. The article covers .Net programming but these same problems affect all developers building multithreaded applications in conventional languages.
The article says, "correctly engineered concurrent code must live by an extra set of rules when compared to its sequential counterpart." This is putting it mildly. The developer enters a minefield of processor code reordering, data atomicity, and worse. Let's look at what those "extra rules" are. Note the rafts of new terminology the poor developer has to learn.
- "Forgotten synchronization". When multiple threads access shared data, they step on each others' work. This causes "race conditions": bizarre loops, freezes, and data corruption bugs. These effects are timing and load dependent, so non-deterministic and hard to reproduce and debug. The developer must use locks and semaphores, place code into critical sections, and so on, so that shared data is safely read or written by only one thread at a time.
- "Incorrect granularity". Having put all the dangerous code into critical sections, the developer can still get it wrong, easily. Those critical sections can be too large and they cause other threads to run too slowly. Or they can be too small, and fail to protect the shared data properly.
- "Read and write tearing". Reading and writing a 32-bit or 64-bit value may often be atomic. But not always! The developer could put locks or critical sections around everything. But then the application will be slow. So to write efficient code, he must learn the system's memory model, and how the compiler uses it.
- "Lock-free reordering". Our multithreaded developer is getting more skilled and confident and finds ways to reduce the number of locks in his code. This is called "lock-free" or "low-lock" code. But behind his back, the compiler and the CPU are free to reorder code to make things run faster. That means that instructions do not necessarily happen in a consistent order. Working code may randomly break. The solution is to add "memory barriers" and to learn a lot more about how the processor manages its memory.
- "Lock convoys". Too many threads may ask for a lock on the same data and the entire application grinds to a halt. Using locks is, we discover as we painfully place thousands of them into our code, inherently unscalable. Just when we need things to work properly, they do not. There's no real solution to this except to try to reduce lock times and re-organize the code to reduce "hot locks" (i.e. real conflicts over shared data).
- "Two-step dance". In which threads bounce between waking and waiting, not doing any work. This just happens in some cases, due to signalling models, and luckily for our developer, has no workarounds. When the application runs too slowly, he can tell his boss, "I think it's doing an N-step dance," and shrug.
- "Priority inversion". In which tweaking a thread's priority can cause a lower-priority threads to block a higher-priority thread. As the Microsoft article says, "the moral of this story is to simply avoid changing thread priorities wherever possible."
This list does not cover the hidden costs of locking: context switches and cache invalidation that ruin performance.
Learning Erlang suddenly seems like a good idea.
Towards an Ideal Solution
Of all the approaches taken to multithreading, only one is known to work properly. That means, it scales to any number of cores, avoids all locks, costs little more than conventional single-threaded programming, is easy to learn, and does not crash in strange ways. At least no more strangely than a normal single-threaded program.
Ulf Wiger summarizes the key to Erlang's concurrency thus:
- "Fast process creation/destruction
- Ability to support » 10 000 concurrent processes with largely unchanged characteristics.
- Fast asynchronous message passing.
- Copying message-passing semantics (share-nothing concurrency).
- Process monitoring.
- Selective message reception."
The key is to pass information as messages rather than shared state. To build an ideal solution is fairly delicate but it's more a matter of perspective than anything else. We need the ability to deliver messages to queues, each queue feeding one process. And we need to do this without locks. And we need to do this locally, between cores, or remotely, between boxes. And we need to make this work for any programming language. Erlang is great in theory but in practice, we need something that works for Java, C, C++, .Net, even Cobol. And which connects them all together.
If done right, we eliminate the problems of the traditional approach and gain some extra advantages:
- Our code is thread-naive. All data is private. Without exception, all of the "likely problems of multithreading code" disappear: no critical sections, no locks, no semaphores, no race conditions. No lock convoys, 3am nightmares about optimal granularity, no two-step dances.
- Although it takes care to break an application into tasks that each run as one thread, it becomes trivial to scale an application. Just create more instances of a thread. You can run any number of instances, with no synchronization (thus no scaling) issues.
- The application never blocks. Every thread runs asynchronously and independently. A thread has no timing dependencies on other threads. When you send a message to a working thread, the thread's queue holds the message until the thread is ready for more work.
- Since the application overall has no lock states, threads run at full native speed, making full use of CPU caches, instruction reordering, compiler optimization, and so on. 100% of CPU effort goes to real work, no matter how many threads are active.
- This functionality is packaged in a reusable and portable way so that programmers in any language, on any platform, can benefit from it.
If we can do this, what do we get? The answer is nothing less than: perfect scaling to any number of cores, across any number of boxes. Further, no extra cost over normal single-threaded code. This seems too good to be true.
zeromq,使得并发编程通过复制的消息实现,这些实现是基于socket来的,其理念是每个进程绑定运行在一个核上,核与核不共享任何信息,进程间通信采用消息,这样就不会有进程上下文切换开销。
另外,zeromq设计了messaging patterns like topic pub-sub, workload distribution, and request-response. ,push/pull。
The ØMQ (ZeroMQ) Framework
ØMQ started life in 2007 as an iMatix project to build a low-latency version of our OpenAMQ messaging product, with Cisco and Intel as partners. From the start, ØMQ was focussed on getting the best possible performance out of hardware. It was clear from the start that doing multithreading "right" was the key to this.
We wrote then in a technical white paper that:
"Single threaded processing is dramatically faster when compared to multi-threaded processing, because it involves no context switching and synchronisation/locking. To take advantage of multi-core boxes, we should run one single-threaded instance of an AMQP implementation on each processor core. Individual instances are tightly bound to the particular core, thus running with almost no context switches."
ØMQ is special, and popular, for several reasons:
- It's fully open source and is supported by a large active community. There are over 50 named contributors to the code, the bulk of them from outside iMatix.
- It has developed an ultra-simple API based on BSD sockets. This API is familiar, easy to learn, and conceptually identical no matter what the language.
- It implements real messaging patterns like topic pub-sub, workload distribution, and request-response. This means ØMQ can solve real-life use cases for connecting applications.
- It seems to work with every conceivable programming language, operating system, and hardware. This means ØMQ connects entire applications as well as the pieces of applications.
- It provides a single consistent model for all language APIs. This means that investment in learning ØMQ is rapidly portable to other projects.
- It is licensed as LGPL code. This makes it usable, with no licensing issues, in closed-source as well as free and open source applications. And those who improve ØMQ automatically become contributors, as they publish their work.
- It is designed as a library that we link with our applications. This means there no brokers to start and manage, and fewer moving pieces means less to break and go wrong.
- It is above all simple to learn and use. The learning curve for ØMQ is roughly one hour.
And it has odd uses thanks to its tiny CPU footprint. As Erich Heine writes, "the [ØMQ] perf tests are the only way we have found yet which reliably fills a network pipe without also making cpu usage go to 100%".
Most ØMQ users come for the messaging and stay for the easy multithreading. No matter whether their language has multithreading support or not, they get perfect scaling to any number of cores, or boxes. Even in Cobol.
One goal for ØMQ is to get these "sockets on steroids" integrated into the Linux kernel itself. This would mean that ØMQ disappears as a separate technology. The developer sets a socket option and the socket becomes a message publisher or consumer, and the code becomes multithreaded, with no additional work.