Dr. Dobb’s Blogger的Walter Bright曾写了一篇博文《Overlooked Essentials For Optimizing Code》,为我们总结了两个最容易被人忽略的基本代码优化技术。酷壳个人网站版主陈皓对本文进行了翻译,现转载于此,供大家学习。全文如下:
我编写程序至今有35年了,我做了很多关于程序执行速度方面优化的工(一个示例),我也看过其它人做的优化。我发现有两个最基本的优化技术总是被人所忽略。
注意,这两个技术并不是避免时机不成熟的优化。并不是把冒泡排序变成快速排序(算法优化)。也不是语言或是编译器的优化。也不是把 i*4写成i<<2 的优化。
这两个技术是:
使用 一个profiler。
查看程序执行时的汇编码。
使用这两个技术的人将会成功地写出运行快的代码,不会使用这两个技术的人则不行。下面让我为你细细道来。
使用一个 Profiler
我们知道,程序运行时的90%的时间是用在了10%的代码上。我发现这并不准确。一次又一次地,我发现,几乎所有的程序会在1%的代码上花了99% 的运行时间。但是,是哪个1%?一个好的Profiler可以告诉你这个答案。就算我们需要使用100个小时在这1%的代码上进行优化,也比使用100个 小时在其它99%的代码上优化产生的效益要高得多得多。
问题是什么?人们不用profiler?不是。我工作过的一个地方使用了一个华丽而奢侈的Profiler,但是自从购买这个Profiler后, 它的包装3年来还是那么的暂新。为什么人们不用?我真的不知道。有一次,我和我的同事去了一个负载过大的交易所,我同事坚持说他知道哪里是瓶颈,毕竟,他是一个很有经验的专家。最终,我把我的Profiler在他的项目上运行了一下,我们发现那个瓶颈完全在一个意想不到的地方。
就像是赛车一样。团队是赢在传感器和日志上,这些东西提供了所有的一切。你可以调整一下赛车手的裤子以让其在比赛过程中更舒服,但是这不会让你赢得 比赛,也不会让你更有竞争力。如果你不知道你的速度上不去是因为引擎、排气装置、空体动力学、轮胎气压,或是赛车手,那么你将无法获胜。编程为什么会不同呢?只要没有测量,你就永远无法进步。
这个世界上有太多可以使用的Profiler了。随便找一个你就可以看到你的函数的调用层次,调用的次数,以前每条代码的时间分解表(甚至可以到汇编级)。我看过太多的程序员回避使用Profiler,而是把时间花在那些无用的,错误的方向上的“优化”,而被其竞争对手所羞辱。(译者陈皓注:使用Profiler时,重点需要关注:1)花时间多的函数以优化其算法,2)调用次数巨多的函数——如果一个函数每秒被调用300K次,你只需要优化出0.001毫秒,那也是相当大的优化。这就是作者所谓的1%的代码占用了99%的CPU时间)
查看汇编代码
几年前,我有一个同事,Mary Bailey,她在华盛顿大学教矫正代数(remedial algebra),有一次,她在黑板上写下:
x + 3 = 5
然后问他的学生“求解x”,然后学生们不知道答案。于是她写下:
__ + 3 = 5
然后,再问学生“填空”,所有的学生都可以回答了。未知数x就像是一个有魔法的字母让大家都在想“x意味着代数,而我没有学过代数,所以我就不知道这个怎么做”。
汇编程序就是编程世界的代数。如果某人问我“inline函数是否被编译器展开了?”或是问我“如果我写下i*4,编译器会把其优化为左移位操作吗?”。这个时候,我都会建议他们看看编译器的汇编码。这样的回答是不是很粗暴和无用?通常,在我这样回答了提问者后,提问都通常都会说,对不起,我不知道什么是汇编!甚至C++的专家都会这么回答。
汇编语言是最简单的编程语言了(就算是和C++相比也是这样的),如:
ADD ESI,x
就是(C风格的代码)
ESI += x;
而:
CALL foo
则是:
foo();
细节因为CPU的种类而不同,但这就是其如何工作的。有时候,我们甚至都不需要细节,只需要看看汇编码的长啥样,然后和源代码比一比,你就可以知道汇编代码很多很多了。
那么,这又如何帮助代码优化?举个例子,我几年前认识一个程序员认为他应该去发现一个新的更快的算法。他有一个benchmark来证明这个算法,并且其写了一篇非常漂亮的文章关于他的这个算法。但是,有人看了一下其原来算法以及新算法的汇编,发现了他的改进版本的算法允许其编译器把两个除法操作变 成了一个。这和算法真的没有什么关系。我们知道除法操作是一个很昂贵的操作,并且在其算法中,这俩个除法操作还在一个内嵌循环中,所以,他的改进版的算法当然要快一些。但,只需要在原来的算法上做一点点小的改动——使用一个除法操作,那么其原来的算法将会和新的一样快。而他的新发现什么也不是。
下一个例子,一个D用户张贴了一个 benchmark 来显示 dmd (Digital Mars D 编译器)在整型算法上的很糟糕,而ldc (LLVM D 编译器) 就好很多了。对于这样的结果,其相当的有意见。我迅速地看了一下汇编,发现两个编译器编译出来相当的一致,并没有什么明显的东西要对2:1这么大的不同而 负责。但是我们看到有一个对long型整数的除法,这个除法调用了运行库。而这个库成为消耗时间的杀手,其它所有的加减法都没有速度上的影响。出乎意料 地,benchmark 和算法代码生成一点关系也没有,完全就是long型整数的除法的问题。这暴露了在dmd的运行库中的long型除法的实现很差。修正后就可以提高速度。所 以,这和编译器没有什么关系,但是如果不看汇编,你将无法发现这一切。
查看汇编代码经常会给你一些意想不到的东西让你知道为什么程序的性能是那样。一些意想不到的函数调用,预料不到的自傲,以及不应该存在的东西,等等其实所有的一切。但也不需要成为一个汇编代码的黑客才能干的事。
结论
如果你觉得需要程序有更好的执行速度,那么,最基本的方法就是使用一个profiler和愿意去查看一下其汇编代码以找到程序的瓶颈。只有找到了程序的瓶颈,此时才是真正在思考如何去改进的时候,比如思考一个更好的算法,使用更快的语言优化,等等。
常规的做法是制胜法宝是挑选一个最佳的算法而不是进行微优化。虽然这种做法是无可异议的,但是有两件事情是学校没有教给你而需要你重点注意的。第一 个也是最重要的,如果你优化的算法没没有参与到你程序性能中的算法,那么你优化他只是在浪费时间和精力,并且还转移了你的注意力让你错过了应该要去优化的部分。第二点,算法的性能总和处理的数据密切相关的,就算是冒泡排序有那么多的笑柄,但是如果其处理的数据基本是排好序的,只有其中几个数据是未排序的, 那么冒泡排序也是所有排序算法里性能最好的。所以,担心没有使用好的算法而不去测量,只会浪费时间,无论是你的还是计算机的。
就好像赛车零件的订购速底是不会让你更靠进冠军(就算是你正确安装零件也不会),没有Profiler,你不会知道问题在哪里,不去看汇编,你可能知道问题所在,但你往往不知道为什么。
原文链接:Overlooked Essentials For Optimizing Code
译文链接:http://coolshell.cn/articles/2967.html
Sep 10, 2010
I've been programming for 35 years now, and I've done a lot of work optimizing programs for speed (an example), and watching others optimize. Two essential techniques are consistently ignored.
Nope, it isn't avoiding premature optimization. It isn't replacing bubble sort with quicksort (i.e. algorithmic improvements). It's not what language used, nor is it how good the compiler is. It isn't writing i<<2 instead of i*4.
It is:
The people who do those are successful in writing fast code, the ones who do not are not. Let me explain.
The old programming saw is that a program spends 90% of its time in 10% of the code. I've found that to not be true. Over and over, I've found that programs spends 99% of its time in 1% of the code. But which 1%? A profiler will tell you. Spending 100 hours of dev time on that 1% will yield real benefits, while 100 hours on the other 99% will not produce much of anything worthwhile.
What's the problem? Don't people use profilers? Nope. One place I worked at had a fancy expensive profiler that was still in its shrink wrap 3 years after purchase. Why don't people use profilers? I don't really know. I once got into a heated exchange with a colleague who insisted he knew where the bottlenecks were; after all, he was an experienced professional. I finally ran the profiler myself on his project, and of course the bottleneck was in a completely unexpected place.
Consider auto racing. The team that wins has sensors and logging on just about everything they can stick a sensor on. You can race using seat-of-the-pants tuning and have a jolly good time on the track, but you won't win and you won't even be competitive. You won't know if your poor speeds are caused by the engine, the exhaust, the aerodynamics, the tire pressure, or the driver. Why should programming be any different? You can't improve what you can't measure.
There are lots of profilers available. You can get ones that look at the hierarchy of function calls, function times, times broken down for each statement, and even at the instruction level. I've seen too many programmers eschew profilers, preferring instead to whittle away their time with useless and misdirected "optimizations" and getting trounced by their competitors.
Years ago, I had a colleague, Mary Bailey, who taught remedial algebra at the University of Washington. She told me once that when she wrote on the board:
x + 3 = 5
and asked her students to "solve for x", they couldn't answer. But, if she wrote:
__ + 3 = 5
and asked the students to "fill in the blank" all of them could do it. It seems that the magic word "x" seemed to cause them to reflexively think "x means algebra, I don't understand algebra, I can't do this."
Assembler is the algebra of the programming world. If someone asks me "was my function inlined by the compiler" or "if I write i*4, will the compiler optimize it to a left shift" I'll suggest they look at the asm output of the compiler. The reaction is how rude and unhelpful could I be? The person will follow up by saying he doesn't know assembler. Even C++ experts will say this.
Assembler is the simplest language (especially compared with C++!). For example,
ADD ESI,x
is (expressed in C style):
ESI += x;
and:
CALL foo
is:
foo();
Details vary among CPUs, but that's how it works. It's not even really necessary to know that. Just looking at the assembler output and comparing it to the source code will tell a LOT.
How does this help optimization? For example, I knew a programmer years ago who thought he'd discovered a new, faster algorithm to do X. I'm being deliberately vague to protect him. He had the benchmarks to prove it, and wrote a nice article about it. But then someone looked at the assembler output of the regular way, and his new fast way. It turns out that the way he'd written his improved version had allowed the compiler to replace two DIV instructions with one. This had really nothing to do with his algorithm. But DIV is an expensive instruction, and this was in the inner loop, and so his algorithm appeared to be faster. The regular implementation could also be recoded slightly to use only one DIV, too, and it would perform just as fast as the new algorithm. He had discovered nothing.
For my next example, a D user posted a benchmark showing that dmd (Digital Mars D compiler) was lousy at integer arithmetic, while ldc (LLVM D compiler) was much better. Being very concerned about such a result, I promptly looked at the assembler output. It was pretty much equivalent, nothing stood out as being accountable for a 2:1 difference. But there was a long divide in there, done with a call to a runtime library function. That function call completely dominated the timing results, all the adds and subtracts in the benchmark had no significant impact on the speed. Unexpectedly, the benchmark wasn't about arithmetic code generation at all, it was about long division only. It turns out that dmd's runtime library function had a crummy implementation of long division in it. Fixing that brought the speed up to par. It wasn't the code generation at fault at all, but this was not discoverable without looking at the assembler.
Looking at the assembler often gives unexpected insight into why a program performs as it does. Unexpected function calls, unanticipated bloat, things that shouldn't be there, etc., all are exposed when looking at it. It isn't necessary to be an assembler crackerjack to be able to pick that up.
If you feel the need for speed, the way to get it is to use a profiler and be willing to examine the assembler for the bottlenecks. Only then is it time to think about better algorithms, faster languages, etc.
Conventional wisdom has it that choosing the best algorithm trumps any micro-optimizations. Though that is undeniably true, there are two caveats that don't get taught in schools. First and most importantly, choosing the best algorithm for a part of the program that has no participation to the performance profile has a negative effect on optimization because it wastes your time that could be better invested in making actual progress, and diverts attention from the parts that matter. Second, algorithms' performance always varies with the statistics of the data they operate on. Even bubble sort, the butt of all jokes, is still the best on almost-sorted data that has only a few unordered items. So worrying about using good algorithms without measuring where they matter is a waste of time - your's and computer's.
Just like ordering speed parts from an auto racing catalog isn't going to put you anywhere near the winner's circle (even if you get them installed right), without profiling, you won't know where the problems are without a profiler. Without looking at the assembler, you may know where the problem is, but often won't know why.
Thanks to Bartosz Milewski, David Held, and Andrei Alexandrescu for their helpful comments on a draft of this.