多核处理器架构的改良

Design Space Exploration: Multi Core  

Instruction

In this assignment, I map an application onto a multi-core X86 platform simulated by Snipersim. The goal is to optimize for Energy-Delay-Area-Product (EDAP). The application I am using is the same EEG application as we used for the single core assignment but this time will look at running the application at a multicore processor. In order to do this I will use Snipersim to simulate the processor.


Step 1: Find the parallel region. 

The first step is to find the parallel region. At first I used the syntax #pragma omp parallel for, but it seems that the the for loop is run for multi times. Maybe because there are some recursive calculations or calculation in one thread needs operands from other threads.Then I tried the section syntax and put for loop into several sections, then it works.

Then I tried to find the parallel region in the main.c at first. But the main.c structure is very simple and I couldn’t find any parallel region that can improve the EDAP sharply. But I found a region can be run parallel, but it can only decrease the delay slightly.  Then I gave two cores for this region,because if gave it more cores, the overheads will exceed the improvement by the parallel.

Then I found that the main.c include the Analysis.h and graphics.h, I think that I have a better chance to find parallel regions in the Analysis.c and graphics.c. In the Analysis.c there is a largefor loop, then I think it can improve the performance a lot if I can parallel this region. After some small change of the codes, it works and reduce the delay sharply, thought it gave us worse EDAP performance, and I

think it can be fixed in the following steps. What’s even better is that the instructions are evenly allocated to different cores, the master core only run a little bit more instructions than other cores. 

Then I worked on graphic.c , fft.c andother documents (we can see some of the efforts in the program comment), but Icouldn’t find any other parallel regions.

    

Step2: decide the number ofcores

Then I tried to decide how many cores to use. I tested 1 core, 4 cores, 8 cores and 16 cores. And we can see the result from Figure 1, the EDAP is increased as the core number increased almost linearly. It seems that more cores means higher EDAP, but it may because that we haven’t optimize the architecture for multi-cores. We cannot draw any conclusion from this figure, so I decided to do some optimization first and hopeful we can get some more meaningful results.

                                多核处理器架构的改良_第1张图片

 

Figure 1: EDAP vs core number

 

I decided to do a coarse optimization on the 16 cores architecture, and apply the parameters toother architectures. Then I may find the best option and do fine modulation on that structure. As a coarse modulation, I decided to follow the rule: simpleand small first level cache, complex and large third level cache. And it did dramatically reduce the EDAP. Because small and simple first level cache cangive us some shorter response time, and large third level cache can cover the first level and second level caches’ miss, and then it can reduce the miss penalty. If we can reduce the first level cache hit time, we can then set the CPU frequency higher. But in this simulator we don’t consider this improvement.I have tested this rule in the single core project, to my surprise, it canstill be so powerful in the multicores system. After several experiments, I decided to use the following configuration:

L1 data cache: 16KB, 1-way associative per-core;

L1 instruction cache: 8KB, 1-way associativepercore; 

L2 unify cache: 128KB, 4-way associative per-core;

L3 unify cache: 1024KB, 8-way associative by allcores;

Then I apply this configuration to different systems, and we can see the results from Figure 2, and Table 1. As Figure 2, we can get the minimum EDAP when there are 4 cores. From Table 1, we can see that more cores doesn’t only reduce the cycle, but also the energy consumption. Though more complex structures have higher power, it can reduce the power consumption due to shorter working time. While the 16 cores structure has the highest energy consumption, maybe because that I didn’t take full advantage of its calculate capability. If I could find more parallel region, 8cores structure or even 16 cores structure can get the best EDAP performance.Then next step I will be focused on the 4 cores architecture. 

 

 多核处理器架构的改良_第2张图片

Figure 2: coarse modulated structure EDAP vs corenumber

 

 

 

cycles

energy

area

EDAP

1core

220889172

1.18

31.32

8163533663

2cores

120179448

1

52.1

6261349241

4cores

66737510

0.92

93.62

5748128431

8cores

40261080

1.02

187.58

7703216854

16cores

26667673

1.21

376.18

1.2139E+10

Table 1: different coresperformance

 

Step 3: fine optimization onthe 4 cores structure

I found that the miss rate of level 3 cache is pretty low, I think we can reduce the cache size to reduce the area and power consumption. FromTable 2, we can see that 256KB Level 3 cache is big enough, and it gives us thebest EDAP performance. Then I did the same thing to the level 2 cache, and found that 32K is enough. Then we can see the miss rate on Level 1 data cache is15%, it’s a key point to reduce the delay. Then I set the cache size to 16, 32,64, only to find out it cost more energy and area. The miss rate almost doesn’t change, because of the poor special locality of data. So I just keep the original data cache size unchanged.

L3 cache  size

cycle

energy

area

EDAP

1024

66737510

0.92

93.62

5748128431

512

66963905

0.92

90.32

5564325508

256

67873658

0.92

88.65

5535639799

128

71469975

0.96

87.44

5999361229

Table 2: the performance of 4cores structure with different Level 3 cache size

 

Since the parallel region I chose is a for loop, and most instructions of the program are in theloop, it turns out that all of the 4 cores almost run the same instructions.Then why not just use the same level 1 instruction cache? It can reduce thecache miss, because we can initialize the cache for only one time instead of 4times for 4 cores. And it also makes the chip’s area smaller a bit. As the test, the delay did reduce a little. But I don’t think it’s a good idea to share the first level cache, if we also run some other programs on the architecture. 

To reduce the level 1data cache hit time, I set the write through = 1, but it increased the EDAP.Because the data may be used for several times, so using write through will increase the miss rate.

Then I had a more aggressive idea, since the L2 cache miss rate had already been very small, why don’twe just throw the level 3 cache away. At first, I just directly removed thelevel 3 cache, I found the result is not bad. Then I decided to optimize the structure without level 3 cache. I changed the cache size to 64, 128, 256KB,and we can see the results from Table 3, that the size 64 give us the best result. Then I made the L2 cache a shared cache. Then I changed it to be sharedby 4 cores, no surprise, I got a poor result, because only 1/4 of the cache left.

Then I modified the cachesize, and we can see the results from Table 4. Shared cache with size 256KB has better performance in every aspect than four 64KB cache for every core. It makes scene, because different cores may share some data and instructions, and one big cache can reduce the replace miss. And we can see the cache with size 512KB gives us the best performance. Then it’s the final configuration.

l2 cache size

delay

energy

area

 

EDAP

32

69728242

0.92

 

80.52

5165356602

64

68137021

0.91

 

81.48

5052142069

128

67265511

0.91

 

83.35

5101988111

256

66876497

0.92

 

89.31

5494920751

Table 3: the performance of different L2 cache sizewithout L3 cache

 

l2 cache size

delay

energy

 

area

 

EDAP

128

70283576

 

0.92

 

74.18

4796544814

256

68235340

 

0.89

 

75.67

4595397678

512

66695195

 

0.88

 

77.79

4565632913

Table 4: the performance of different shared L2 cachesize without L3 cache

 

Conclusion 

The structure with configuration:

L1 data cache: 16KB, 1-way associative shared by allcores;

L1 instruction cache: 8KB, 1-way associativepercore; 

L2 unify cache: 512KB, 8-way associative shared by allcores;

No L3 unify cache!

give us the best EDAP,with 66695195 cycles delay, 0.88 J energy consumption, 77.79 mm^2 area. TheEDAP is about 52% as that of the default structure. Also, the new structure is3.4 times as fast as the default one, and only consume 72% energy compared bythe default one. As the development of the encapsulation technology, we will get smaller and smaller chips and better EDAP for multi core architecture. And the optimization is just for the eeg application, I reduced a lot of cache for saving energy and area. If we need a machine that only runs eeg application, itwill be a good choice. If we need the machine do something else, it will be adisaster.  

Also, I lean that optimization rules for single core architectures can be used in the multi-corearchitectures. 

你可能感兴趣的:(计算机架构,Snipersim,计算机架构,多核处理器,EDAP,OpenMp)