Study Note: RoofLine Model

Some background knowledge: 


Here is some connection between latency, throughput and concurrency [1]:


Study Note: RoofLine Model_第1张图片


Here is the influence factor of runtime and performance: latency and throughput.


Also, you can reduce running time by increase the throughput. 


An example of higher performance != shorter runtime: Imagine a sparse matrix multiplication will take shorter runtime than normal matrix_mul. However, the throughput will be larger than the former one. 


A useful hit of FLOP: must notice that floating point operation means operation of add and mul. If you code only use one operation, there is no way you can get the peak performance of FLOP. You can only achieve half of it. 


AI (Arithmetic Intensity)


Arithmetic Intensity is the ratio of total floating-point operations to total data movement (bytes). A higher AI means that more likely to achieve the peak performance. Must notice that: 


Arithmetic intensity is ultimately limited by compulsory traffic (this cannot be optimised). However, Arithmetic intensity is diminished by conflict or capacity misses. (These can be optimised). 


RoofLine Model


What is the axises of RoofLine Model?


Answer: Here is the y axis


Study Note: RoofLine Model_第2张图片


It means that if the optimisation did not increase the stream bandwidth, AI* Bandwidth will still smaller than the peak performance and it will be the attainable performance.


and the x axis is 


Actual FLOP:Byte ratio. (AI)


We would like to use this as an example to construct RoofLine Model[1]: 


Study Note: RoofLine Model_第3张图片


The initial model of RoofLine is like this: 



Study Note: RoofLine Model_第4张图片


We can tell once it can reach 4FLOP per 1byte or higher, the 2356 will reach its peak performance. But if the bandwidth is small and bandwidth/total data needed movement(byte) is <1, then actual performance(attainable performance) is smaller than peak performance. (since bandwidth/total data needed in movement is less than 1).


There are others constraint of performance like computation ceils, communication ceils and locality walls in RoofLine Model. Remember that we called RoofLine model is synthesising communication,computation,and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.


Now, we have talked a little about the communication. Let us talk it deeper: 


The stream line limitation just assumes that you can use the all bandwidth to transfer data. However, the problem is whether you can. 


1) Explicit software prefetch instructions are required to achieve peak bandwidth. If you have't used the prefetch instructions, you will get a ceiling like this[1]:


Study Note: RoofLine Model_第5张图片


2) Also, 2356 is NUMANon Uniform Memory Access Architecture. As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. If you haven't we will get a ceiling like this: 


Study Note: RoofLine Model_第6张图片


After talking about the communication, let us talk about computation. 


1) If the code is dominated by adds, then attainable performance is half of peak.


Study Note: RoofLine Model_第7张图片


2) Opterons have 128-bit data paths(two ways double precision). If instructions aren’t SIMDized, attainable performance will be halved. 


Study Note: RoofLine Model_第8张图片


3) On Opterons, floating-point instructions have a 4 cycle latency. If we don’t express 4-way ILP(Instruction level parallelism), performance will drop by as much as 4x. 


Study Note: RoofLine Model_第9张图片


What is instruction level parallelism?


Actually, instruction level parallelism has been taken care by compiler automatically. Instruction level parallelism means to find out instructions in a sequence that can be computed at the same.


Even though our compiler has done a good job on taking care of this. But, Compilers may produce valid instruction sequences that have less ILP when executed in order. Like it will produce instructions like: 


Study Note: RoofLine Model_第10张图片 

Obviously, more clever choice will be: 


Study Note: RoofLine Model_第11张图片


If we use a CPU of traditional in order pipeline, it will run the instruction sequence as the compiler tells with two different means: 


Study Note: RoofLine Model_第12张图片


However, even with the in-order super scalar pipeline, it will still reduce the instruction level parallelism. 


However, if we use a out of order pipeline CPU, it allows instruction reordering. 


It can automatically assemble the instructions. 


Study Note: RoofLine Model_第13张图片


Disadvantage: As you can see, much of the portion of this cpu is focusing on direct the instructions and control the boundary issues. It is a significantly more complex processor architecture than simple in order pipeline. And it also consume more energy. 


Now, let us talk about locality issues: 


1) Remember, memory traffic includes more than just compulsory misses. Actual arithmetic intensity may be substantially lower. The original boundary is just for only includes compulsory misses. 


Study Note: RoofLine Model_第14张图片  Study Note: RoofLine Model_第15张图片


The actual roofline model will be: 


Study Note: RoofLine Model_第16张图片


While the AI will be: 


Study Note: RoofLine Model_第17张图片



你可能感兴趣的:(性能,并发,roofline)