本文内容基于COMP5201课程讲义和David Patterson John Hennessy - Computer Organization and Design (5th Edition)教材内容整理。加上自己的一些理解,将知识整理便于复习。(未经许可,不得转载)
Design for Moore’s Law 按照摩尔定律的要求设计芯片
Use Abstraction to Simplify Design 用Abstraction将各个level representation分离 Use abstractions to characterize the design at different levels of representation
Make the Common Case Fast
Making the common case fast will tend to enhance performance better than optimizing the rare case.
优先考虑将最常出现的情况最优化,一些例外可能不能达到最优化,但是不影响整体性能。
Performance via Parallelism
并行提高性能
Performance via Pipelining
Pipelines充分利用CPU性能
Performance via Prediction
预测将来要用的数据, temporal locality (时间), spatial locality(空间).
Hierarchy of Memories
多层memories,hard disk, main memory, multiple-level caches, register, cpu.
Dependability via Redundancy
Computers not only need to be fast; they need to be dependable. Since any physical device can fail, we make systems dependable by including redundant components that can take over when a failure occurs and to help detect failures.
Higher level language (C/C++) transform to Assembly language (eg. RISC-V, MIPS) by Compiler (gcc, g++). Assembly language transform to Binary machine language program by Assembler
They are input, output, memory, datapath, and control, with the last two sometimes combined and called the processor.
完美并行部分,executive time / number of parallels 完全非并行部分,executive time Consider a program with one portion that is perfectly sequential, and another perfectly parallel portion that can be made as parallel as we like.
The vocabulary of commands understood by a given architecture. 不同计算机体系结构有不同的指令集
RISC-V, developed by UC Berkeley starting in 2010
MIPS is an elegant example of the instruction sets designed since the 1980s.
In several respects, RISC-V follows a similar design.
The Intel x86 originated in the 1970s, but still today powers both the PC and the Cloud of the post-PC era.
stored-program concept: The idea that instructions and data of many types can be stored in memory as numbers and thus be easy to change, leading to the stored-program computer
Unsigned Number: 直接将bits从二进制转换为十进制
Signed Number: “two-complement number”
Most significant bit (leftmost) is 0 --> positive number
Most significant bit (leftmost) is 1 --> negative number
Example: In a 4-bit register, using two’s complement semantics, we have the following interpretations:
0000 = 0, 0001 = 1, 0010 = 2, 0011 = 3,
0100 = 4, 0101 = 5, 0110 = 6, 0111 = 7,
1000 = -8, 1001 = -7, 1010 = -6, 1011 = -5,
1100 = -4, 1101 = -3, 1110 = -2, and 1111 = -1.
How to negate two’s complement number, 如何转换正负数 1. flip every bit. 2. add one.
Binary expansion: 将浮点数二进制展开,指数同时包括正指数和负指数 0.75 = 2 − 1 + 2 − 2 0.75 = 2^{-1} + 2^{-2} 0.75=2−1+2−2
(m+n)-digit radix r fixed-point number with ‘m’ whole digits, numbers from 0 to r^m - r^-n, in increments of r^-n. 小数点前m位,小数点后n位。e.g. (2+3)-bit binary fixed-point number,
2.375 = (1 * 2^1) + (0 * 2^0) + (0 * 2^-1) + (1 * 2^-2) + (1 * 2^-3) = (10.011).
Blackboard notation: similar to scientific notation
In standard-computer bit patterns, we will drop “1.”
three parts when representing a floating point number: sign (one bit), exponent(two’s complement, e.g. 4 bits), fractional part (the rest of bits)
A machine instruction for an arithmetic/logic operation specifies an opcode, one or more source operands, and, usually, one destination register.
There are three instruction formats in MIPS 1. Register or R-type instructions, operate on two registers rs, rt (source operands), store the result in register rd (destination). 32 bits in total. Note:registers rs,rt, rd在instruction中只占5个bits, 储存对应的register编号,如果是加法运算,具体的数字储存在对应的register中。
R | opcode | rs | rt | rd | \ |
---|---|---|---|---|---|
6 bits | 5 bits | 5 bits | 5 bits | 11 bits |
mul.d f4, f2, f6
The contents of f2 and f6 are read, the result is placed into f4.
Immediate or I-type instructions Note:performed on rs and immediate, store result in rt.
I | opcode | rs | rt | immediate |
---|---|---|---|---|
6 bits | 5 bits | 5 bits | 16 bits |
l.d f6, -24(r2)
add the immediate byte-offset -24 to ‘r2’ to determine a memory address. Then, we load the double-precision floating point number (64 bits) from that memory location and put it into floating-point register ‘f6’. s.d f6, -24(r2) add the immediate byte-offset 24 to ‘r2’ to determine a memory address. Then, we store the double-precision floating point number (64 bits) in floating-point register ‘f6’ into that memory location.
bne r1, r2, loop compare register ‘r1’ and register ‘r2’. If they are not equal, we add the word-offset derived from the immediate ‘loop’ to the current value of PC as the new value of PC.
Jump or J-type instructions Note: Jump or J-type instructions cause unconditional transfer of control to the instruction at the specified address. Word address (as opposed to a byte address), two zeros are appended to the right. 由于是word-address, 跳转计数单位为1 word = 4 bytes = 32 bits.
J | opcode | partial jump-target address |
---|---|---|
6 bits | 26 bits |
j done
Addressing mode is the method by which the location of an operand is specified within an instruction 1. Immediate addressing The operand is given in the instruction itself. daddui r1, r1 #-8 2. Register addressing The operand is taken from, or the result placed into, a specified register. mul.d f4, f2, f6 3. Base addressing The operand is in memory and its location is computed by adding a byte-address offset (16-bit signed integer) to the contents of a specified base register. l.d f6 -24(r2); s.d f6, 24(r2) 4. PC-relative addressing This is the same as base addressing, except that the “base” register is always PC, and a hardware trick is used to extend the signed-integer offset to 18 bits. Namely, we multiply by 4 to obtain a word-address offset. And then sign extended to 32 bits. beq r1, r2, found; bne r1, r2, loop 5. Absolute addressing The addressing mode for unconditional branches is different because we don’t really have a “base” register. j done 26-bits natural number, multiplying by 4 to a 28-bit natural number, pad the front of ‘done’ with four leading bits of PC, giving us a 32-bit (word) address.
Notations for class COMP5201
How C++ and Java program work Compiling C and interpreting Java To be Continued
Similar to digit addition and subtraction. When subtracting a number, an overflow occurs which will result a correct answer. Example: 000111 = 7
1|000001 = 1
Multiplication is a bit trickier. There isn’t one way to do it. The simplest to explain corresponds to what we learned in lower school (assume positive numbers):
put multiplier in 32-bit register
put multiplicand in 64-bit register
initialize 64-bit product to zero
loop: test lsb of multiplier
if 1, add multiplicand to product
shift multiplicand register 1-bit left
shift multiplier register 1-bit right
if not done, goto loop
Summary: Multiplication hardware simply shifts and adds, as derived from the paper-and-pencil method learned in grammar school. Compilers even use shifts instructions for multiplications by powers of 2.
To be continued… (textbook)
To be continued… (textbook)
In other words: 1. All operations on data apply to data in registers. 2. The only operators that affect memory are loads (which move data from memory to a register) and stores (which move data from a register to memory). 3. The instruction formats are few in number, with all instructions typically being one size.
Pipeline Consider a computer system that takes in operations on the left, computes them, and then pushes out results on the right. In a pipeline, we may push in new operations on the left long before getting the results of previous operations pushed out on the right.
Three Parameters 1. Peak input bandwidth 2. Operation latency 3. Pipeline occupancy (concurrency) When pipeline reach equilibrium state, concurrency = bandwidth * latency
fdxmw instruction-execution pipeline A special case: at equilibrium, input bandwidth = 1, output bandwidth = 1, latency = 5, concurrency = 5.
Boxes and Latches
five boxes: f d x m w
four latches between each two boxes, f/d, d/x, x/m, m/w latches.
Instruction from I-cache. Data from D-cache. Content of register from register file.
Process of execution 1. f-box:
Memory Wall Memory increase not fast enough to keep pace with processor improvements.
Power Wall Increase in performance will cause increase in power density, resulting in a high chip temperature. High temperature will slow down the speed and even melt the chip.
To make pipeline work, each box is followed immediately on its right by a set of (nonISA) pipeline registers, which is called a “pipeline latch”. The basic requirement is this: Prior to the end of a clock cycle, all the results from a given stage must be stored in the pipeline latch to its right, in order that these values can be preserved across clock cycles, and used as inputs to the next stage at the start of the next clock cycle.
Memory hierarchy
A structure that uses multiple levels of memories; as the distance from the processor increases, the size of the memories and the access time both increase.
temporal locality
is present in a program when the code and data used in the recent past is highly likely to be reused in the near future
spatial locality
is present in a program when the code and data currently in use is highly likely to be followed by the use, in the near future, of code and data at nearby memory locations
block or line
The minimum unit of information that can be either present or not present in a cache. Eg. 16 bytes as a memory block, or called line.
The amount that copied form memory to cache is a called a cache line.
cache frame
A cache frame contains a cache line (content from memory bloc), tag field, valid bit.
hit rate / miss rate
The fraction of memory access found in a level of the memory hierarchy
hit time
The time required to access a level of the memory hierarchy, including the time needed to determine whether the access is a hit or a miss.
hit penalty
The time from lower level cache (or memory) to upper level cache.
In other words, the time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit it from one level to the other, insert it in the level that experienced the miss, and then pass the block to the requestor.
set
In m-way set associated cache, a set contains m cache frames.
SRAM Technology
SRAM is short for Static Random Access Memory. The level closer to CPU (cache) use SRAM. SRAM do not need to refresh so the access time is very close to the cycle time. It use six to eight transistors per bit to prevent the information from being disturbed when read. Much more expensive than DRAM.
DRAM Technology
Dynamic RAM, the value kept in a charged capacitor, it cannot be kept indefinitely and must periodically be refreshed.
The fastest version is called Double Data Rate (DDR) SDRAM. A DDR4-3200 DRAM can do 3200 million transfers per second, which means it has a 1600-MHz clock.
Flash Memory
Flash memory is a type of electrically erasable programmable read-only memory (EEPROM).
Unlike disks and DRAM, but like other EEPROM technologies, writes can wear out flash memory bits.
Disk Memory
Cylinder, track, sector, seek time, rotational latency…
Write-through
A scheme in which writes always update both the cache and the next lower level of the memory hierarchy, ensuring that data are always consistent between the two.
Write-buffer
A queue that holds data while the data are waiting to be written to memory.
Write-back
A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced.
Average memory access time
Avg_time = hit_time + miss_rate * miss_penalty
Reducing cache miss by set-associated way cache
In set-associated way cache, a set contain more frames. When processor request a line in cache, it will compare the whole set to find whether data is in this set. This means more parallelism, and it can reduce cache miss.
Reducing cache penalty using multi-level caches
When miss happen in primitive cache, it will go to the second level cache to find the data. Usually, the second-level cache is larger and contains more data, the miss penalty from primitive cache to second-level cache is much smaller than primitive cache to memory.
Software optimization via Blocking
When deliver information, we need to make sure the reliability of this data transfer. So we need to defining failures.
some change