GPU架构基础 之 L1 data cache & Unified L2 cache IN Fermi Arch

NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache
Working with hundreds of GPU computing
applications from various industries, we learned
that while Shared memory benefits many
problems, it is not appropriate for all problems.
Some algorithms map naturally to Shared
memory, others require a cache, while others
require a combination of both. The optimal
memory hierarchy should offer the benefits of
both Shared memory and cache, and allow the
programmer a choice over its partitioning. The
Fermi memory hierarchy adapts to both types of
program behavior.

1.英伟达 fermi 架构,,为了既满足 那些 shared mem friendly ,又满足 cache friendly 的程序,还有这两者都需要的程序,提出 了 L1/shared mem 可配置的 架构。

Adding a true cache hierarchy for load / store
operations presented significant
challenges. Traditional GPU architectures
support a read-only ‘‘load’’ path for texture
operations and a write-only ‘‘export’’ path for
pixel data output. However, this approach is
poorly suited to executing general purpose C or
C++ thread programs that expect reads and
writes to be ordered. As one example: spilling a
register operand to memory and then reading it
back creates a read after write hazard; if the
read and write paths are separate, it may be necessary to explicitly flush the entire write /
‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be
coherent with respect to the write data.

2.传统的 架构中 支持 read-only load path for texture 和 一条 wrtie-only “export” path for pixel data output.但是这不能满足 通用的gpgpu 的C/C++这种期望 read 和 write 是 有顺序的程序。
eg.举个例子,如果说,一个溢出的寄存器,在写操作之后在,产生一个读请求的话,如果这两条 path 是分开的,那么就会产生not coherence 的问题。

The Fermi architecture addresses this challenge by implementing a single unified memory
request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2
cache that services all operations (load, store and texture). The per-SM L1 cache is
configurable to support both shared memory and caching of local and global memory
operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16
KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with
48 KB of shared memory, programs that make extensive use of shared memory (such as
electrodynamic simulations) can perform up to three times faster. For programs whose memory
accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved
performance over direct access to DRAM.

你可能感兴趣的:(GPU,体系架构)