

 Why Registers Are Fast and RAM Is Slow

                                                                                                                                                                  by Mike Ash 

In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers? 

在上一篇关于ARM64的文章中,我提到新架构的一个优点是它具有两倍的寄存器,允许代码从RAM中加载数据的频率较低,这要慢得多。读者丹尼尔·胡珀问自然的问题:刚才为什么 RAM比寄存器这么慢得多?

Let's start with distance. It's not necessarily a big factor, but it's the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.


Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that's two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that's pretty significant. However, it's much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.


Much as we might wish it wasn't, cost is always a factor. In software, when trying to make a program run fast, we don't go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it'll make the most difference.


Registers get used extremely frequently, and there aren't a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It's worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a $650 phone instead of a $6,500 phone.

寄存器非常频繁地使用,而且它们并不多。A7中只有大约6,000位寄存器数据(32个64位通用寄存器加32个128位浮点寄存器,还有一些杂项寄存器)。iPhone 5S中有大约80亿比特(1GB)的RAM。花一大笔钱让每个寄存器更快一点是值得的。实际上有多达百万倍的RAM位,如果你想要650美元的手机而不是6500美元的手机,那么80亿位必须要尽可能便宜。

Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.

寄存器非常频繁地使用,而且它们并不多。A7中只有大约6,000位寄存器数据(32个64位通用寄存器加32个128位浮点寄存器,还有一些杂项寄存器)。iPhone 5S中有大约80亿比特(1GB)的RAM。花一大笔钱让每个寄存器更快一点是值得的。实际上有多达百万倍的RAM位,如果你想要650美元的手机而不是6500美元的手机,那么80亿位必须要尽可能便宜。

Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you'd expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that's halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.


The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.


There's also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there's so much more of it. Reading from a register looks like:


  1. Extract the relevant bits from the instruction.
  2. Put those bits onto the register file's read lines.
  3. Read the result.
  1. 从指令中提取相关位。
  2. 将这些位放在寄存器文件的读取线上。
  3. 阅读结果。

Reading from RAM looks like:


  1. Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)
  2. Send that pointer off to the MMU.
  3. The MMU translates the virtual address in the pointer to a physical address.
  4. Send the physical address to the memory controller.
  5. Memory controller figures out what bank of RAM the data is in and asks the RAM.
  6. The RAM figures out particular chunk the data is in, and asks that chunk.
  7. Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
  8. Load the data from the array.
  9. Send it back to the memory controller.
  10. Send it back to the CPU.
  11. Use it!
  1. 获取指向正在加载的数据的指针。(指针可能在寄存器中。这已经包含了上面完成的所有工作!)
  2. 将该指针发送到MMU。
  3. MMU将指针中的虚拟地址转换为物理地址。
  4. 将物理地址发送到内存控制器。
  5. 内存控制器计算出数据所在的RAM组并询问RAM。
  6. RAM计算出数据所在的特定块,并询问该块。
  7. 步骤6可以重复几次,然后将其缩小到单个单元阵列。
  8. 从阵列加载数据。
  9. 将其发送回内存控制器。
  10. 将其发送回CPU。
  11. 用它!


Dealing With Slow RAM
That sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?


First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.


Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn't one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.

回到过去的好日子,当人们自豪地穿着他们的羊并且这个国家在战争中不败时,这不是一个难以回答的问题。这不是一个指令 - 一个周期,但至少有一些明确的对应关系。例如,英特尔4004需要8或16个时钟周期来执行一条指令,具体取决于该指令的内容。很好,也可以理解。事情逐渐变得更加复杂,各种指令的时间范围也各不相同。较旧的CPU手册将列出每条指令执行的时间。

Now? Not so simple.


Along with increasing clock rates, there's also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it's up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn't mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.


On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:


  1. Fetching potentially several instructions at once.
  2. Decoding potentially a completely different set of instructions.
  3. Fetching the data for potentially yet another different set of instructions.
  4. Performing computations for yet more instructions.
  5. Storing data for yet more instructions.
  1. 一次获取可能的几个指令。
  2. 可能解码一组完全不同的指令。
  3. 获取可能的另一组不同指令的数据。
  4. 执行更多指令的计算。
  5. 存储更多指令的数据。

But, you say, how could this possibly work? For example:


    add x1, x1, x2
    add x1, x1, x3

These can't possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!


It's true, that can't possibly work. That's where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn't depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.


What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you're really lucky and the value is in L1 cache, it'll only take a few cycles. If you're unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.


The CPU will try not to twiddle its thumbs, because that's inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it's going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don't depend on the data being loaded, they can still be executed. Finally, once it's executed everything it can and it absolutely cannot proceed any further without that data it's waiting on, it has little choice but to stall and wait for the data to come back from RAM..



  1. RAM is slow because there's a ton of it.
  2. That means you have to use designs that are cheaper, and cheaper means slower.
  3. Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code.
  4. That means that the first thing a CPU does while waiting for a RAM load is run other code.
  5. If all else fails, it'll just stop and wait, and wait, and wait, and wait.
  1. RAM很慢,因为它有很多。
  2. 这意味着你必须使用更便宜的设计,而更便宜意味着更慢。
  3. 现代CPU在内部做疯狂的事情,并且会愉快地执行您的指令流,其顺序与它在代码中的显示方式大不相同。
  4. 这意味着CPU在等待RAM加载时所做的第一件事就是运行其他代码。
  5. 如果一切都失败了,它就会停下来等待,等待,等待,等待。

