@(工具/插件)
最近发现了一种可以评估DRAM访存功耗的工具,对于需要分析片外存储(DRAM)的访存功耗以及延时的设计比较有用,例如:深度学习加速器设计。
CACTI是一种分析工具,它接受一组 Caches/Memory参数作为输入,并计算其访存时间、功耗、周期时间和面积。目前更新到7.0版本,并且支持下面几种Memory的分析:
此外,还有以下功能:
支持multi-ported uniform cache access (UCA)和multi-banked, multi-ported non-uniform cache access (NUCA).
泄漏功耗的计算也考虑到了环境温度。
Router power model.
Interconnect model with different delay, power, and area properties including low-swing wire model.
An interface to perform trade-off analysis involving power, delay,area, and bandwidth.
All process specific values used by the tool are obtained from ITRS and currently, the tool supports 90nm, 65nm, 45nm, and 32nm technology nodes.
Chip IO model to calculate latency and energy for DDR bus. Users can model different loads (fan-outs) and evaluate the impact on frequency and energy. This model can be used to study LR-DIMMs, R-DIMMs, etc.
源码地址:https://github.com/HewlettPackard/cacti
技术文档: http://www.hpl.hp.com/techreports/2013/HPL-2013-79.pdf
在Windows上没调起来(windows上c++库缺少pthread,没找到比较简单的方法),后面直接在Centos上测试,下面是简单的使用方法:
make
cacti
的可执行文件后,执行./cacti -infile ***.cfg
./cacti -infile sample_config_files/ddr3_cache.cfg
最后会得到一个详细的分析文档,这边贴一下:
Cache size : 8388608
Block size : 64
Associativity : 8
Read only ports : 0
Write only ports : 0
Read write ports : 1
Single ended read ports : 0
Cache banks (UCA) : 1
Technology : 0.022
Temperature : 360
Tag size : 42
array type : Cache
Model as memory : 0
Model as 3D memory : 0
Access mode : 0
Data array cell type : 0
Data array peripheral type : 0
Tag array cell type : 0
Tag array peripheral type : 0
Optimization target : 2
Design objective (UCA wt) : 0 0 0 100 0
Design objective (UCA dev) : 20 100000 100000 100000 100000
Cache model : 0
Nuca bank : 0
Wire inside mat : 1
Wire outside mat : 1
Interconnect projection : 1
Wire signaling : 1
Print level : 1
ECC overhead : 1
Page size : 8192
Burst length : 8
Internal prefetch width : 8
Force cache config : 0
Subarray Driver direction : 1
iostate : READ
dram_ecc : NO_ECC
io_type : DDR3
dram_dimm : UDIMM
IO Area (sq.mm) = inf
IO Timing Margin (ps) = 35.8333
IO Votlage Margin (V) = 0.155
IO Dynamic Power (mW) = 1282.42 PHY Power (mW) = 232.752 PHY Wakeup Time (us) = 27.503
IO Termination and Bias Power (mW) = 3136.7
---------- CACTI (version 7.0.3DD Prerelease of Aug, 2012), Uniform Cache Access SRAM Model ----------
Cache Parameters:
Total cache size (bytes): 8388608
Number of banks: 1
Associativity: 8
Block size (bytes): 64
Read/write Ports: 1
Read ports: 0
Write ports: 0
Technology size (nm): 22
Access time (ns): 3.03414
Cycle time (ns): 1.84197
Total dynamic read energy per access (nJ): 0.381869
Total dynamic write energy per access (nJ): 0.446873
Total leakage power of a bank (mW): 2520.29
Total gate leakage power of a bank (mW): 4.71441
Cache height x width (mm): 3.07383 x 2.89775
Best Ndwl : 8
Best Ndbl : 8
Best Nspd : 2
Best Ndcm : 1
Best Ndsam L1 : 8
Best Ndsam L2 : 1
Best Ntwl : 16
Best Ntbl : 8
Best Ntspd : 8
Best Ntcm : 1
Best Ntsam L1 : 8
Best Ntsam L2 : 2
Data array, H-tree wire type: Global wires with 30% delay penalty
Tag array, H-tree wire type: Global wires with 30% delay penalty
Time Components:
Data side (with Output driver) (ns): 3.03414
H-tree input delay (ns): 0.860695
Decoder + wordline delay (ns): 0.607741
Bitline delay (ns): 0.473783
Sense Amplifier delay (ns): 0.00189739
H-tree output delay (ns): 1.09002
Tag side (with Output driver) (ns): 0.866708
H-tree input delay (ns): 0.250295
Decoder + wordline delay (ns): 0.0962495
Bitline delay (ns): 0.078
Sense Amplifier delay (ns): 0.00189739
Comparator delay (ns): 0.0162774
H-tree output delay (ns): 0.440265
Power Components:
Data array: Total dynamic read energy/access (nJ): 0.360657
Total energy in H-tree (that includes both address and data transfer) (nJ): 0.270396
Output Htree inside bank Energy (nJ): 0.263979
Decoder (nJ): 0.000237668
Wordline (nJ): 0.000275334
Bitline mux & associated drivers (nJ): 0
Sense amp mux & associated drivers (nJ): 0
Bitlines precharge and equalization circuit (nJ): 0.00163006
Bitlines (nJ): 0.0612354
Sense amplifier energy (nJ): 0.0018371
Sub-array output driver (nJ): 0.0249178
Total leakage power of a bank (mW): 2357.99
Total leakage power in H-tree (that includes both address and data network) ((mW)): 18.9776
Total leakage power in cells (mW): 0
Total leakage power in row logic(mW): 0
Total leakage power in column logic(mW): 0
Total gate leakage power in H-tree (that includes both address and data network) ((mW)): 0.0916133
Tag array: Total dynamic read energy/access (nJ): 0.0212128
Total leakage read/write power of a bank (mW): 162.298
Total energy in H-tree (that includes both address and data transfer) (nJ): 0.00268136
Output Htree inside a bank Energy (nJ): 0.00104879
Decoder (nJ): 0.000585105
Wordline (nJ): 0.000356972
Bitline mux & associated drivers (nJ): 0
Sense amp mux & associated drivers (nJ): 0.000288214
Bitlines precharge and equalization circuit (nJ): 0.00153419
Bitlines (nJ): 0.0132631
Sense amplifier energy (nJ): 0.00155643
Sub-array output driver (nJ): 8.13397e-05
Total leakage power of a bank (mW): 162.298
Total leakage power in H-tree (that includes both address and data network) ((mW)): 0.23223
Total leakage power in cells (mW): 0
Total leakage power in row logic(mW): 0
Total leakage power in column logic(mW): 0
Total gate leakage power in H-tree (that includes both address and data network) ((mW)): 0.00146699
Area Components:
Data array: Area (mm2): 7.28836
Height (mm): 3.07383
Width (mm): 2.3711
Area efficiency (Memory cell area/Total area) - 73.1983 %
MAT Height (mm): 0.716448
MAT Length (mm): 0.540768
Subarray Height (mm): 0.328909
Subarray Length (mm): 0.26532
Tag array: Area (mm2): 0.377107
Height (mm): 0.716051
Width (mm): 0.526648
Area efficiency (Memory cell area/Total area) - 74.9106 %
MAT Height (mm): 0.173381
MAT Length (mm): 0.063873
Subarray Height (mm): 0.0822272
Subarray Length (mm): 0.027995
Wire Properties:
Delay Optimal
Repeater size - 42.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.216837 (ns/mm)
PowerD - 0.000279845 (nJ/mm)
PowerL - 0.0215298 (mW/mm)
PowerLgate - 9.15623e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
5% Overhead
Repeater size - 17.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.226875 (ns/mm)
PowerD - 0.0001818 (nJ/mm)
PowerL - 0.00872349 (mW/mm)
PowerLgate - 3.70994e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
10% Overhead
Repeater size - 15.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.235988 (ns/mm)
PowerD - 0.000174237 (nJ/mm)
PowerL - 0.00769899 (mW/mm)
PowerLgate - 3.27424e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
20% Overhead
Repeater size - 12.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.257722 (ns/mm)
PowerD - 0.00016297 (nJ/mm)
PowerL - 0.00616223 (mW/mm)
PowerLgate - 2.62069e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
30% Overhead
Repeater size - 10.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.28134 (ns/mm)
PowerD - 0.000155511 (nJ/mm)
PowerL - 0.00513773 (mW/mm)
PowerLgate - 2.18498e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
Low-swing wire (1 mm) - Note: Unlike repeated wires,
delay and power values of low-swing wires do not
have a linear relationship with length.
delay - 0.0902442 (ns)
powerD - 2.8399e-06 (nJ)
PowerL - 1.71796e-07 (mW)
PowerLgate - 1.29017e-09 (mW)
Wire width - 4.4e-08 microns
Wire spacing - 4.4e-08 microns
Segmentation fault
其中
Cache Parameters:
Total dynamic read energy per access (nJ): 0.381869
Total dynamic write energy per access (nJ): 0.446873
给出了单次的读写功耗。
具体的配置文件相关条目的说明可以翻阅上面提到的技术文档,后面有时间再研究一下。