Memory Hierarchy

  • 参考: C o m p u t e r   A r i c h i t e c t u r e   ( 6 th ⁡   E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

目录

  • The Principle of Locality
  • Memory Hierarchy
    • Terminology
    • 存储层次的性能参数
    • Four Questions for Memory Hierarchy
      • Q1: Where can a block be placed in the upper level?
      • Q2: How is a block found if it is in the upper level?
      • Q3: Which block should be replaced on a miss?
      • Q4: What happens on a write?
  • Virtual Memory Address Space
    • Four Memory Hierarchy Questions Revisited
      • Q1: Where Can a Block Be Placed in Main Memory?
      • Q2: How Is a Block Found If It Is in Main Memory?
      • Q3: Which Block Should Be Replaced on a Virtual Memory Miss?
      • Q4: What Happens on a Write?
    • Caching vs. Demand Paging
  • Cache Design
    • Six Basic Cache Optimizations
    • What causes a MISS?
    • 1. Larger Block Size to Reduce Miss Rate
    • 2. Larger Caches to Reduce Miss Rate
    • 3. Higher Associativity to Reduce Miss Rate
    • 4. Multilevel Caches to Reduce Miss Penalty
    • 5. Giving Priority to Read Misses over Writes to Reduce Miss Penalty
    • 6. Avoiding Address Translation during Indexing of the Cache to Reduce Hit Time
    • 7. Victim Cache to Reduce Miss Rate
    • Summary of Basic Cache Optimization

The Principle of Locality

局部性原理

  • The Principle of Locality:
    • Program access a relatively small portion of the address space at any instant of time.
    • It is a property of programs which is exploited in machine design.
  • Two Different Types of Locality:
    • Temporal Locality (时间局部性) (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
    • Spatial Locality (空间局部性) (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)

Memory Hierarchy

Memory Hierarchy_第1张图片

  • Goal: Illusion of large, fast, cheap memory

Memory Hierarchy_第2张图片


Memory Hierarchy: Apple iMac G5

Memory Hierarchy_第3张图片

Terminology

  • Hit 命中: data appears in some block in the upper level (example: Block X X X)
    • Hit Rate: the fraction of memory access found in the upper level
      • So high that usually talk about Miss rate
    • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
  • Miss 失效: data needs to be retrieve from a block in the lower level (Block Y Y Y)
    • Miss Rate = 1 - (Hit Rate)
      • as MIPS to CPU performance, miss rate to average memory access time in memory
    • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
  • Hit Time << Miss Penalty
    Memory Hierarchy_第4张图片

存储层次的性能参数

  • 设: S S S ── 容量; T A T_A TA ── 访问时间; C C C ── 每位价格
  • 考虑由 M 1 M_1 M1 M 2 M_2 M2 构成的两级存储层次:
    • M 1 M_1 M1 的参数: S 1 S_1 S1 T A 1 T_{A_1} TA1 C 1 C_1 C1
    • M 2 M_2 M2 的参数: S 2 S_2 S2 T A 2 T_{A_2} TA2 C 2 C_2 C2
  • 每位价格 C C C
    在这里插入图片描述
  • 命中率 H H H失效率 F F F
    H = N 1 / ( N 1 + N 2 ) F = 1 - H H=N_1/(N_1+N_2)\\F=1-H HN1/(N1N2)F1H
    • N 1 N_1 N1 ── 访问 M 1 M_1 M1 的次数; N 2 N_2 N2 ── 访问 M 2 M_2 M2 的次数
  • Average memory access time = Hit time + Miss rate x Miss penalty
    T A = H T A 1 + ( 1 - H ) ( T A 1 + T M ) = T A 1 + ( 1 - H ) T M = T A 1 + F T M T_A=HT_{A_1}+(1-H)(T_{A_1}+T_M)=T_{A_1}+(1-H)T_M= T_{A_1}+FT_M TAHTA1(1H)(TA1TM)TA1(1H)TMTA1FTM
    • 失效开销 Miss Penalty T M T_M TM:time to replace a block from lower level, including time to replace in CPU
      • access time: time to lower level = f f f(Latency to lower level)
      • transfer time: time to transfer block = f f f(BandWidth between upper & lower levels)
      • 从向 M 2 M_2 M2 发出访问请求到把整个数据块调入 M 1 M_1 M1 中所需的时间: T M = T A 2 + T B T_M =T_{A_2}+T_B TMTA2TB (传送一个信息块所需的时间为 T B T_B TB (根据数据量大小发生变化 (总线宽度)))

程序执行时间

  • CPU 时间=(CPU 执行周期数+存储器停顿周期数) × × × 时钟周期时间
    • 存储器停顿时钟周期数=访存次数 × × × 失效率 × × × 失效开销
      Memory Hierarchy_第5张图片

  • 假设 Cache 失效开销为 50 个时钟周期,当不考虑存储器停顿时,所有指令的执行时间都是 2.0 个时钟周期,访问 Cache 失效率为 2%,平均每条指令访存 1.33 次 (每条指令都必访问指令 Cache,不一定访问数据 Cache)。试分析 Cache 对性能的影响

  • CPU 时间=IC × × × (2.0+1.33 × × × 2% × × × 50) × × × 时钟周期时间 = IC × × × 3.33 × × × 时钟周期时间;
  • 实际 CPI :3.33
  • 但若不采用 Cache, 则:CPI=2.0+50 × × × 1.33=68.5

Four Questions for Memory Hierarchy

Q1: Where can a block be placed in the upper level?

  • 块的放置:映像规则 Block placement

  • 全相联映象:主存中的任一块可以被放置到 Cache 中的任意一个位置
    • 空间利用率最高,冲突概率最低,实现最复杂。
  • 直接映象:主存中的每一块只能被放置到 Cache 中唯一的一个位置; 对于主存的第 i i i 块,若它映象到 Cache 的第 j j j 块,则 j = i   m o d    ( M ) j=i\ \mod (M ) ji mod(M) M M M 为 Cache 的块数); 设 M = 2 m M=2^m M2m,则当表示为二进制数时, j j j 实际上就是 i i i 的低 m m m
    • 空间利用率最低,冲突概率最高,实现最简单
  • 组相联映象:主存中的每一块可以被放置到 Cache 中唯一的一个组中的任何一个位置。若主存第 i i i 块映象到第 k k k 组,则: k = i   m o d    ( G ) k=i \ \mod(G) ki mod(G) G G G 为 Cache 的组数)设 G = 2 g G=2^g G2g,则当表示为二进制数实际上就是 i i i 的低 g g g
    • 是直接映象和全相联的一种折衷。相联度越高,空间利用率就越高,块冲突概率就越低,失效率也就越低
      Memory Hierarchy_第6张图片

Q2: How is a block found if it is in the upper level?

  • 块的定位:查找算法 Block identification

在这里插入图片描述

  • tag 是块标记,Index 是组地址

Fully Associative Cache

Memory Hierarchy_第7张图片

2-Way Set-Associative Cache

Memory Hierarchy_第8张图片

Direct-Mapped Cache

Memory Hierarchy_第9张图片

Q3: Which block should be replaced on a miss?

  • 块的替换:替换算法 Block replacement

  • Easy for Direct Mapped, no choice
  • Set Associative or Fully Associative:
    • Random
    • First in, first out (FIFO)
    • LRU (Least Recently Used)
      Memory Hierarchy_第10张图片

Q4: What happens on a write?

  • 写策略 Write strategy

Memory Hierarchy_第11张图片


  • Additional option – let writes to an un-cached address allocate a new cache line (“write-allocate”). (写一个不在 cache 中的数据)

Write allocate and No-write allocate

  • Write allocate (fetch on write) 按写分配 (写时取)
    • The block is allocated on a write miss. 写失效时,先把所写单元所在的块调入Cache,再行写入
    • Write-back caches always use. (要写的数据总在 Cache 中,因此常用写回法)
  • No-write allocate (write around) 不按写分配 (绕写法)
    • This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory.
    • Write-through caches often use.

Write Policy Choices

  • Cache hit: write through / write back
  • Cache miss: no write allocate / write allocate
  • Common combinations:
    • write through & no write allocate
    • write back & write allocate

Write Buffers for Write-Through Caches

Memory Hierarchy_第12张图片

  • Q. Why a write buffer?
    • So CPU doesn’t stall
  • Q. Why a buffer, why not just one register?
    • Bursts of writes are common.
  • Q. Are Read After Write (RAW) hazards an issue for write buffer?
    • Yes! Drain buffer before next read, or send read 1st after check write buffers.

Virtual Memory Address Space

Memory Hierarchy_第13张图片

  • User programs run in an standardized virtual address space (每一个进程都有自己的虚存空间)
  • Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory Hardware supports “modern” OS features: Protection, Translation, Sharing

Use virtual addresses for cache?

  • A. Synonym(同义字) problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a nightmare.

Four Memory Hierarchy Questions Revisited

Q1: Where Can a Block Be Placed in Main Memory?

  • Operating systems allow blocks to be placed anywhere in main memory. The strategy is fully associative.

Q2: How Is a Block Found If It Is in Main Memory?

  • Both paging and segmentation rely on a data structure that is indexed by the page or segment number.
    • For paging, this data structure is page table which contains the physical page address. Indexed by the virtual page number, the size of the table is the number of pages in the virtual address space.
      • Given a 32-bit virtual address,4 KB pages, and 4 bytes per Page Table Entry (PTE), the size of the page table would be ( 2 32 / 2 12 ) × 2 2 = 2 22 (2^{32}/2^{12}) \times 2^2 = 2^{22} (232/212)×22=222 or 4 4 4 MB.

Q3: Which Block Should Be Replaced on a Virtual Memory Miss?

  • Replace the least-recently used (LRU) page.
    • A use bit or reference bit is provided. The operating system periodically clears the bits and later records them so it can determine which pages were touched during a particular time period.

Q4: What Happens on a Write?

  • Because of the great discrepancy in access time, the write strategy is always write back.
    • Virtual memory systems usually include a dirty bit. It allows blocks to be written to disk only if they have been altered since being read from the disk.

Page replacement policy

Memory Hierarchy_第14张图片

Caching vs. Demand Paging

Memory Hierarchy_第15张图片

Cache Design

Six Basic Cache Optimizations

  • Average memory access time = Hit time + Miss rate x Miss penalty

Reducing Miss Rate

  • (1) Larger Block size (compulsory misses)
  • (2) Larger Cache size (capacity misses)
  • (3) Higher Associativity (conflict misses)

Reducing Miss Penalty

  • (4) Multilevel Caches
  • (5) Giving Reads Priority over Writes
    • E. g., Read complete before earlier writes in write buffer

Reducing hit time

  • (6) Avoiding address translation when indexing the cache

What causes a MISS?

Three Major Categories of Cache Misses:

  • Compulsory 强制性 —The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses (冷启动失效) or first-reference misses.(首次访问失效)
  • Capacity 容量 —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved.
  • Conflict 冲突 —If the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses (碰撞失效).

1. Larger Block Size to Reduce Miss Rate

  • Larger block sizes will reduce compulsory misses, because larger blocks take advantage of spatial locality.

Larger Blocks may Increase Conflict Misses

  • Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small.
    Memory Hierarchy_第16张图片

可以看到,Cache 容量为 16K 或 4K 时,block size 过大还会导致 miss rate 上升。这是因为在 Cache 容量有限的情况下,过大的 block size 使 Cache 中放不下几个 block,可能会导致其他类型的 miss


Larger Blocks may Increase the Miss Penalty

  • Assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Thus, it can supply 16 bytes in 82 clock cycles, 32 bytes in 84 clock cycles, and so on.
    • 16-byte block → \rightarrow Miss penalty: 82; 32-byte block → \rightarrow Miss penalty: 84
    • The selection of block size depends on both the latency and bandwidth of the lower-level memory.
      Memory Hierarchy_第17张图片

2. Larger Caches to Reduce Miss Rate

  • The obvious way to reduce capacity misses is to increase capacity of the cache.
    • The obvious drawback is potentially longer hit time and higher cost and power.
    • This technique has been especially popular in off-chip caches.

3. Higher Associativity to Reduce Miss Rate

  • Figure shows how miss rates improve with higher associativity
    Memory Hierarchy_第18张图片

Higher Associativity Increase the Clock Cycle Time

  • Greater associativity can come at the cost of increased hit time.
    • Clock cycle time 2-way = 1.36 × Clock cycle time 1-way
    • Clock cycle time 4-way = 1.44 × Clock cycle time 1-way
    • Clock cycle time 8-way = 1.52 × Clock cycle time 1-way
      Memory Hierarchy_第19张图片

  • 硬件复杂度太高,因此现在最多也就三级 Cache

4. Multilevel Caches to Reduce Miss Penalty

  • The performance gap between processors and memory leads the architect to this question: Should I make the cache faster to keep pace with the speed of processors, or make the cache larger to overcome the widening gap between the processor and main memory?
    • One answer is, do both. Adding another level of cache between the original cache and memory simplifies the decision.
    • The first-level cache can be small enough to match the clock cycle time of the fast processor. Yet the second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.

A Typical Memory Hierarchy

Memory Hierarchy_第20张图片

多核架构下的多级 Cache

Memory Hierarchy_第21张图片

产生 Cache 一致性问题,这个后面讲


It Complicates Performance Analysis

  • Average memory access time = Hit time L 1 \textrm{Hit\ time}_{L_1} Hit timeL1 + Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1 × \times × ( Hit time L 2 \textrm{Hit\ time}_{L_2} Hit timeL2 + Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2 × \times × Miss penalty L 2 \textrm{Miss\ penalty}_{L_2} Miss penaltyL2
    • Local miss Such as Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1, and Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2.
    • Global miss The global miss rate for the first-level cache is still just Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1, but for the second-level cache it is Miss rate L 1 \textrm{Miss\ rate}_{L_1} Miss rateL1 × \times × Miss rate L 2 \textrm{Miss\ rate}_{L_2} Miss rateL2.

5. Giving Priority to Read Misses over Writes to Reduce Miss Penalty

e.g. Read complete before earlier writes in write buffer

write-through cache

  • With a write-through cache the most important improvement is a write buffer of the proper size. Write buffers do complicate memory accesses because they might hold the updated value of a location needed on a read miss.
  • Assume a direct-mapped, write-through cache that maps 512 and 1024 to the same block, and a four-word write buffer that is not checked on a read miss. Will the value in R 2 R_2 R2 always be equal to the value in R 3 R_3 R3?
    • 不一定,得看什么时候把 write buffer 中的脏数据写回 memory
SW R3, 512(R0)		;M[512] ← R3 (cache index 0) (实际写入 write buffer)
LW R1, 1024(R0)		;R1 ← M[1024] (cache index 0) (block 1024 替换 Cache 中的 block 512)
LW R2, 512(R0) 		;R2 ← M[512] (cache index 0)

Solve the Problem

  • The simplest way is for the read miss to wait until the write buffer is empty.
  • The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue. (如果有 conflict,一个比较激进的方法是直接从 write buffer 中取数据)

write-back cache

  • The cost of writes by the processor in a write-back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and then write memory. This way the processor read, for which the processor is probably waiting, will finish sooner.

6. Avoiding Address Translation during Indexing of the Cache to Reduce Hit Time

  • Cache must cope with the translation of a virtual address from the processor to a physical address to access memory.

virtual caches vs. physical cache

  • Using virtual addresses for the cache eliminates address translation time from a cache hit. 但每个进程都有一个虚存空间,即不同虚地址可能对应同一个物理地址。有如下两种解决方案:
    • (1) One solution is when a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed.
    • (2) The alternative is to increase the width of the cache address tag with a process-identifier tag (PID).

virtual caches 最终未能流行,因为产生了 Cache 一致性的问题

7. Victim Cache to Reduce Miss Rate

  • 基本思想:在 Cache 和它从下一级存储器调数据的通路之间设置一个全相联的小 Cache,用于存放被替换出去的块(称为 Victim),以备重用
    • 对于减小冲突失效很有效,特别是对于小容量的直接映象数据 Cache,作用尤其明显
    • 例如,项数为 4 的 Victim Cache: 使 4KB Cache 的冲突失效减少 20%~90%

Summary of Basic Cache Optimization

  • No optimization in this figure helps more than one category.

+ meaning that the technique improves the factor, meaning it hurts that factor

Memory Hierarchy_第22张图片

你可能感兴趣的:(计算机体系结构)