Open×××没有多处理,人所皆知,我觉得我有点啰嗦了,天天说这个事。为什么没有多处理呢?我们来看下Open×××的作者,大牛级别的,早已超越代码的重量级人物,James Yonan(简称JY)是怎么解释的。
       一切都在maillist中,有人问过,为何Open×××不实现多线程,并且人家给出了实际的测试数据。JY是这么回答的:
Open××× 2.0 has no multithreading support, this is the only feature present in
1.x which has been removed from 2.0.
好 吧,明确说明了Open××× 2.0时代不支持多线程了,此前的1.0时代,多线程是有的,但是并不用于数据传输,即不是用于数据通道的。注意,由于讨论仅仅局限于处理过程的CPU开 销,和我之前所想的一样,在1.0时代,由于Open×××只是建立一个加密隧道,只有隧道中有数据的时候才会有CPU开销,然而何时有数据是不知道的, 所以使用内核的调度机制是不明智的(内核的task entry调度是基于一系列的预测的),因此CPU的开销只是在控制通道的TLS握手阶段(对于非SSL情况也一样,预共享密钥,用户名/密码的验证只是 比SSL弱了一些)才能定量计算,因此Open×××只是将额外的线程用于这个协商阶段,在数据传输阶段,Open×××仅仅使用一个线程,并且内部实现 了自己的packet schedule机制。
       注意,不要认为Open×××没有实现多线程就不好(这是我之前的误区,对于别人而言,要么喷我,要么根本就不关注此事),事实上,我被折服了,单线程的 Open×××将这个唯一的线程对资源的利用率维持地如此之高,让人钦佩。关键就是它自己的packet schedule机制。在Open××× 2.0时代,甚至连控制通道协商阶段的独立线程都取消了,JY的意思是这样的:
The original rationale for having the TLS thread optimization was to improve
latency during the TLS key negotiation which is very CPU intensive.  The 1.x
pthread implementation uses pthreads only for this very special case, which
does not improve overall efficiency on multiprocessor machines, but helps to
keep tunnel-forwarding latency down during the TLS negotiation.

I did some testing on 2.0 to determine the worst-case latency caused by the
TLS negotiation in single threaded mode.  On a 2GHz x86, the worst-case
latency was about 160 milliseconds for a 2048 bit key and 40 milliseconds for
a 1024 bit key.  Even with 100 users hitting a TLS renegotiate once per hour,
the probability that two or more of these 160 millisecond latency periods
would overlap to make a bigger latency is still quite small.

I think these latency numbers are too small to justify the extra level of
complexity entailed by multithreading.  Not to mention whole classes of
potential bugs which arise when you attempt to multithread code, and
incompatibilities that exist between multithread implementations on different
OSes.  Bottom line is that I don't think multithreading in Open××× is worth
the trouble.
收益不足以弥补代价,就这么简单。我想如果Open×××只是为Linux定制的,事情会好很多。看来好的软件不仅仅是特定平台的最高效,更多的是所有平台的可运行。JY是怎么对这件事下结论的呢?首先看一下JY的阐述:
Keep in mind that people use multithreading to:

(1) improve latency, or
(2) improve performance on multithreaded machines

Open××× 1.x only tried to hit (1).

With Open××× 2.0, my decision was basically that (1) didn't justify the
complexification that pthread support would entail and that (2) is satisfied
by different means.

So how do you improve performance on multithreaded machines, to take advantage
of all processors, i.e. if (1) is not worth the effort, then how to
accomplish (2)?
思 路超级地清晰,无比的清晰(也许是我找到共鸣了吧),他根本就不把特定的,特殊的,100+ms协商1小时受用的SSL握手过程,用户名/密码校验过程等 作为瓶颈,同时在数据通道的传输中对称加密的效率是一个定值,那么所有的提高效率的关键点就是:如何提高传输性能,这个思路不偏不倚,非常客观公正,我为 什么这么说呢?对于关注SSL协议的人来讲,他首先关注的是SSL性能,因为他有SSL优化的经验和能力,而实际上,这种偏爱可能已经把路走偏了;对于关 注网络的人来讲,他总是关注什么多线程利用多网卡队列之类的,因为这方面的资料他天天关注,而实际上,这种偏爱必然也不是正道。JY客观分析了这种两种, 认为SSL作为一种只占运行时间一小部分的动作,特殊现象不足以成为瓶颈,没有必要为其独立一个线程而增加复杂性,同样,传输阶段的包调度不属于 Open×××的控制范畴,多处理同样也不是Open×××要考虑的,那么他给出了结论:
Answer:  Run multiple server mode daemons on different ports, and have the
client load balance between them by using multiple "remote" entries in the
client side config.  This is actually more efficient than multithreading
because each Open××× daemon gets its own private virtual memory address
space, so there is no bus contention from multiple processors over the same
address space, as would occur with a multi-threaded execution model.
是的,由外部来做!
       我想,JY从一开始就是思路清晰的吧,所以他把数据通道和控制通道分离,这个分离让单进程单线程的处理更加超级紧凑,让特殊的SSL过程(请原谅,我也是 SSL关注者,遗憾的是,我关注了两者,不光是SSL,还有数据传输)处理的优化和数据传输的优化可以分开进行。
       不要怀疑Open×××的低效了,它作为一个单进程单线程的程序,它很紧凑,在这一个仅有的线程里,它的packet schedule算法可谓最优化,如果想优化它,注意JY的Answer,同时注意我的blog吧。我不敢和JY称兄道弟,但是事实证明,我俩的思路是一 致的。
       我的偏执在于,我实在不想多个Open×××侦听多个port,所以我才做了多线程,然而,请看一眼我的多线程版本就知道,我其实对于包传输没有做任何改 动,只是共享了multi_instance链表以及IP地址pool而已。我所做的工作也都是外围的工作,我没有修改Open×××的源码,因为我知道 它已经够紧凑,所以我只做外围的封装。我略微修改了协议,但这只是一脬垃圾而已。
       复杂性让位于简洁性的一个完美的例子。