Simultaneous Multithreading (同时多线程)

  • 参考: C o m p u t e r   A r i c h i t e c t u r e   ( 6 th ⁡   E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

目录

  • Limits to ILP and Thread Level Parallelism
    • Limits to ILP
    • Performance beyond single thread ILP
  • Simultaneous Multithreading
    • Multithreading
    • Do both ILP and TLP
    • Simultaneous Multithreading (SMT)
    • Hyper-Threading Technology

Limits to ILP and Thread Level Parallelism

ILP: Instruction level parallelism

Advances in exploiting ILP appears to be coming to an end

  • Exploiting ILP was the primary focus of processor designs for about 20 years starting in the mid-1980s. For the first 15 years, we saw a progression of successively more sophisticated schemes for pipelining, multiple issue, dynamic scheduling and speculation.
  • Since 2000, designers have focused primarily on optimizing designs or trying to achieve higher clock rates without increasing issue rates. (also bigger Cache to improve performance) This era of advances in exploiting ILP appears to be coming to an end.
    • 由于程序代码中存在的数据及控制依赖,单线程中所能发掘的指令并行潜力是有限的。为了发掘有限的指令级并行潜力而一味强化乱序执行和分支预测,以至于处理器复杂度和功耗急剧上升,有时候得不偿失
    • ILP 本身也不再适应计算应用类型的变化
      Simultaneous Multithreading (同时多线程)_第1张图片

Limits to ILP

  • How much ILP is available using existing mechanisms with increasing HW budgets?
    • Theoretically: Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies
    • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future

Assumptions for ideal/perfect machine to start:

先假设一个完美机器模型,然后通过观察在这些假设之上有多少并行性可以挖掘来研究 ILP limits

  1. Register renaming – infinite virtual registers ⇒ \Rightarrow all register WAW & WAR hazards are avoided
  2. Branch prediction, Jump prediction (returns, case statements) – perfect; no mispredictions ⇒ \Rightarrow no control dependencies; perfect speculation & an unbounded buffer of instructions available
  3. Memory-address alias analysis (完美的存储器别名分析: No memory hazards) – addresses known & a load can be moved before a store provided addresses not equal; ⇒ \Rightarrow 1&3 eliminates all but RAW
  4. Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued per clock cycle;

Upper Limit to ILP: Ideal Machine

Simultaneous Multithreading (同时多线程)_第2张图片

POWER5 – 对比用的实际机器

  • The POWER5 is a dual-core microprocessor developed and fabricated by IBM which were first presented at 2003. The principal improvements are support for simultaneous multithreading (SMT) and an on-die memory controller.
    Simultaneous Multithreading (同时多线程)_第3张图片

I I I: 指令 Cache; D D D: 数据 Cache


Window Size

  • The set of instructions that is examined for simultaneous execution is called the window (在窗口中寻找可以被同时发射的指令)
  • The window size will be determined by the cost of determining whether n n n issuing register-register instructions have any register dependences among them. In theory, the cost is about O ( n 2 − n ) O(n^2-n) O(n2n)
    • Window size is limited by the required storage, the comparisons ( O ( n 2 − n ) O(n^2-n) O(n2n)), and a limited issue rate

下面一步步的让完美模型回归现实

More Realistic HW: Window Impact

Simultaneous Multithreading (同时多线程)_第4张图片Simultaneous Multithreading (同时多线程)_第5张图片

因此我们选择一个相对比较大的 Window size: 2048;然后再选取 Instructions issued per clock 为 64 (如上图可见,Window size 为 2048 时,每个时钟周期发射的指令数没有超过 64 的)


More Realistic HW: Branch Impact

Simultaneous Multithreading (同时多线程)_第6张图片Simultaneous Multithreading (同时多线程)_第7张图片Simultaneous Multithreading (同时多线程)_第8张图片

因此我们效果最好的 Tournament 作为分支预测方法


More Realistic HW: Renaming Register Impact ( N N N int + N N N fp)

Simultaneous Multithreading (同时多线程)_第9张图片Simultaneous Multithreading (同时多线程)_第10张图片

因此我们效果最好的 256 作为 Renaming Register 的个数


More Realistic HW: Memory Address Alias Impact

Simultaneous Multithreading (同时多线程)_第11张图片Simultaneous Multithreading (同时多线程)_第12张图片


Realistic HW: Window Impact

Simultaneous Multithreading (同时多线程)_第13张图片

256 个 Renaming registers 在当前仍然不太现实,因此进一步将其限制在 64;2048 的 Window size 同样不现实,因此下一步比较 Window size 带来的影响

Simultaneous Multithreading (同时多线程)_第14张图片

  • 可以看到,窗口大小是最影响发射率的,但它受限于各种因素:
    • The practical implications of very wide issue widths on clock rate, logic complexity, and power may be the most important limitation on exploiting ILP. These are not laws of physics; just practical limits for today, and perhaps overcome via research

Performance beyond single thread ILP

Thread Level Parallelism (TLP)

  • There can be much higher natural parallelism in some applications. ⇒ \Rightarrow Explicit Thread Level Parallelism or Data Level Parallelism
    • Thread Level Parallelism (MIMD)
      • ILP exploits implicit parallel operations within a loop or straight-line code segment.
      • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel
    • Data Level Parallelism (SIMD): Perform identical operations on lots of data

各种多线程技术

  • 现代微处理器多采用硬件多线程技术来发掘线程之间的线程级并行潜力
  • 具有多线程能力的系统包括:
    • 对称多处理机 (SMP, Symmetrical multiprocessors): 一个主板上多 CPU
    • 多内核处理器 (CMP, Chip multiprocessors): 一个 CPU 内多个核
    • 芯片级多处理 (Chip-level multithreading, CMT = CMP+MT): 多核 + 多线程
    • 同时多线程 (Simultaneous multithreading, SMT / 软件多线程): 在原有的单处理核上做多线程

Simultaneous Multithreading

Multithreading

  • Multithreading: multiple threads to share the functional units of 1 processor via overlapping
    • processor must duplicate independent state of each thread. e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
    • memory shared through the virtual memory mechanisms, which already support multiple processes
    • HW for fast thread switch(0.1~10s of clocks); much faster than full process switch ≈ \approx 100s to 1000s of clocks

When switch?

  • fine grain (细粒度): Switches between threads on each instruction, causing the execution of multiples threads to be interleaved
  • coarse grain (粗粒度): When a thread is stalled, perhaps for a cache miss, another thread can be executed. Switches threads only on costly stalls, such as L2 cache misses

Fine-Grained Multithreading

  • CPU must be able to switch threads every clock. Usually done in a round-robin fashion (轮叫调度: 时间片轮转), skipping any stalled threads. (实际可能没有那么多线程可供调度)
    • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls
    • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads

Coarse-Grained Multithreading

  • Advantages:
    • Relieves need to have very fast thread-switching
    • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
  • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs
    • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen (必须排空或暂停). New thread must fill pipeline before instructions can complete.
    • Is better for reducing penalty of high cost stalls, where pipeline refill (流水线重启开销) << stall time

Do both ILP and TLP

  • TLP and ILP exploit two different kinds of parallel structure in a program. Could a processor oriented at ILP to exploit TLP?
    • Functional units are often idle in data path designed for ILP because of either stalls or dependences in the code.
    • Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists?

Do both ILP and TLP …

Simultaneous Multithreading (同时多线程)_第15张图片

8-way superscalar (8 发射超标量)


Commentary

  • Right balance of ILP and TLP is unclear today
    • Perhaps right choice for server market, which can exploit more TLP, may differ from desktop, where single-thread performance may continue to be a primary requirement

Simultaneous Multithreading (SMT)

  • 同时多线程技术是一种在多流出、动态调度处理器上同时开发 TLP 和 ILP 的改进的多线程技术
    • Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading (多发射、动态调度…).
    • Just adding a per thread renaming table and keeping separate PCs. Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

HW mechanisms supporting multithreading

  • Large set of virtual registers that can be used to hold the register sets of independent threads
  • Register renaming provides unique register identifiers (标识寄存器属于哪个线程), so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads
  • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW

Multithreaded Categories

Simultaneous Multithreading (同时多线程)_第16张图片

Issue Slots: 即不同的功能单元
Multiprocessing: 双核;每个核执行一个线程


Design Challenges in SMT

  • 同时多线程只有在细粒度的实现方式下才有意义,并发多个同优先级的线程不可避免地影响单个线程的执行时间
    • 可以通过指定一个优先线程 (preferred thread) 来减小这种影响,只要一有可能 ,处理器就运行指定的优先线程. Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls
  • Larger register file needed to hold multiple contexts
  • Not affecting clock cycle time (必须保持每个时钟周期的低开销), especially in:
    • Instruction issue - more candidate instructions need to be considered
    • Instruction completion - choosing which instructions to commit may be challenging
  • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance (需要保证由于并发执行多个线程带来的 cache 和 TLB 冲突不会导致显著的性能下降)

Example: Changes in Power 5 to support SMT

Power 5 在 Power 4 的基础上增加了 SMP 技术.

Simultaneous Multithreading (同时多线程)_第17张图片

  • Why only 2 threads?
    • With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.
  • The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support
    • Increased associativity of L1 instruction cache and the instruction address translation buffers. Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches (减少 Cache conflict)
    • Increased the number of virtual registers from 152 to 240
    • Added per thread load and store queues. Increased the size of several issue queues
    • Added separate instruction prefetch and buffering per thread

Hyper-Threading Technology

  • Hyper-Threading (超线程) Technology is Intel’s proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors.
    Simultaneous Multithreading (同时多线程)_第18张图片

intel 的四核八线程

Simultaneous Multithreading (同时多线程)_第19张图片

所谓的逻辑核就是超线程的结果

你可能感兴趣的:(计算机体系结构)