Nvidia GPU 的存储架构 发展

查阅了好多论文,以及英伟达的白皮书,最后终于搞明白了。。

从Fermi 到Pascal,cache 的体系结构发生了变化;

1. Fermi

  • L1dcahce 是与 Shared mem 可配置的64kB的大小,一般为 16/48 or 48/16,可读可写 ;
  • 还有专有的对图像渲染的texture cache 和 存放常量的constant cache,只读;
  • 以上L1层的cache是对SM 私有的;为了保证cache coherence 的问题,l1dcache 的写请求 也不会被cache了
  • 综上,l1层 总是只读的

  • l2对于所有sm共享,可读可写

  • 当l2中的数据被写,恰好l1中还存在这个数据,那么将l1中这个数据 使失效,保持了cache coherency;

2. Kepler

cache 层基本继承于Fermi架构,其对于Fermi架构的新的特性就是增加了48KB的READ-ONLY DATA-CACHE.
专门用来缓存只读的数据。

其他同上

3.Maxwell

这一次 英伟达有了一次较大的改变,,完全放弃了在L1层的写,将l1d 与 tex 等专用cache 进行统一;
根据workload 的不同进行选择。

  • global loads are cached in L2 only

  • local loads are cached in L2 only

  • 手册原文

    Maxwell combines the functionality of the L1 and texture caches into a single unit.

    As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG
read-only data cache mechanism introduced in Kepler.

    In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also
allows applications to opt-in to caching of global loads in its unified L1/Texture cache.

    The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to
nvcc at compile time.

    Local loads also are cached in L2 only, which could increase the cost of register spilling
if L1 local load hit rates were high with Kepler. 

    The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. 
Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from
somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.

    The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering
up the data requested by the threads of a warp prior to delivery of that data to the warp.
This function previously was served by the separate L1 cache in Fermi and Kepler.

Pascal

除了增大了l2cache大小之外,cache 架构也是继承与上一代 Maxwell 的

你可能感兴趣的:(深入理解体系结构概念)