http://www.xasun.com/article/2a/1991$3.html
1.With the Xeon 5500 series processors
Intel has diverged from its traditional Symmetric Multiprocessing (SMP) architecture
to a Non-Uniform Memory Access (NUMA) architecture.In a two-processor scenario, the Xeon
5500 series processors are connected through a serial coherency link called QuickPath
Interconnect (QPI). The QPI is capable of 6.4, 5.6 or 4.8 GT/s (gigatransfers per second),
depending on the processor model. The Xeon 5500 series integrates the memory controller within
the processor, resulting in two memory controllers in a two-socket system. Each memory
controller has three memory channels and supports DDR-3 memory. Depending on processor
model, the type of memory used, and the population of memory, memory may be clocked at
1333MHz, 1066MHz or 800MHz. Each memory channel supports up to 3 DIMMs per channel
(DPC), for a theoretical maximum of 9 DIMMs per processor or 18 per 2-socket server. (See
Figure 1 for illustration.) However, the actual maximum number of DIMMs per system is
dependent upon the system design.
新的55系列的至强CPU已经由原来的SMP结构改成了现在的NUMA结构,两个CPU不再对共同的内存资源
管理,而是把内存控制器集成到CPU中,每个CPU可以管理3个通道一共9条内存,CPU之间通过QPI(可以理解为内部总线)互联。而内存使用时能达到的最高频率跟CPU本身和DIMM都有关系。
2.Memory Performance
With the varied number of configurations possible in the Xeon 5500 series processor-based
systems, a number of variables emerge that influence processor/memory performance. The main
variables are memory speed, memory interleaving, memory ranks and memory population across
various memory channels and processors. Depending on the processor model and number of
DIMMs, the performance of the Xeon 5500 platform will see large memory performance
variances. We will look at each of these factors more closely in the next sections.
与内存性能最相关的包括CPU的类型,每通道安装的内存数,内存本身的性能,内存互联的方式,内存的RANK数等等。
2.1 Memory Speed
As mentioned earlier, the memory speed is determined by the combination of the processor
model, DIMM speed, and DIMMs per channel.
2.1.1 Processor model
The initial Xeon 5500 series processor-based offerings will be categorized into 3 bins called
Performance, Volume and Value. The 3 bins have the ability to clock memory at different
maximum speeds:
• 1333MHz (X55xx processor models)
• 1066MHz (E552x or L552x and up)
• 800MHz (E550x)
So, the processor model will limit the maximum frequency of the memory. Note: Because of the
integrated memory controllers the former front-side bus (FSB) no longer exists.
内存控制器集成到CPU中后,FSB就不存在的(没有前端总线的概念,和AMD的处理器一致)。
2.1.2 DDR3 DIMM Speed
DDR-3 memory will be available in various sizes at speeds of 1333MHz and 1066MHz. 1333MHz
represents the maximum capability at which memory can be clocked. However, the memory will
not be clocked faster than the capability of the processor model and will be clocked appropriately
by the BIOS.
2.1.3 DIMMs per Channel (DPC)
The number and type of DIMMs and the channels in which they reside will also determine the
speed at which memory will be clocked. Table 1 describes the behavior of the platform. The table
below assumes a 1333MHz-capable processor model (X55xx). If a slower processor model is
used, then the memory speed will be the lower of the memory speed and the processor model
memory speed capability. If the DPC is not uniform across all the channels, then the system will
clock to the frequency of the slowest channel.
每个通道使用不同数目的内存时,内存工作的频率是不一样的,具体见下表。
表1
2.1.4 Low-level Performance Specifics
It is important to understand the impact of the performance of the Xeon 5500 series platform,
depending on the memory speed. We will use both low-level memory tools and application
benchmarks to quantify the impact of memory speed.
关系内存性能的参数:延迟和吞吐量
Two of the key low-level metrics that are used to measure memory performance are memory
latency and memory throughput. We use a base Xeon 5500 2.93GHz, 1333MHz-capable 2-
socket system for this analysis. The memory configurations for the three memory speeds in the
following benchmarks are as follows:
• 1333MHz – 6 x 4GB dual-rank 1333MHz DIMMs
• 1066MHz – 12 x 2GB dual-rank DIMMs for 1066MHz
• 800MHz – 12 x 2GB dual-rank DIMMs clocked down to 800MHz in BIOS
Note: Memory ranks are explained in detail in section 3.3.
As shown in 表2 below, we show the unloaded latency to local memory. The unloaded
latency is measured at the application level and is designed to defeat processor prefetch
mechanisms. As shown in the 表2, the difference between the fastest and slowest speeds is
about 10%. This represents the high watermark for latency-sensitive workloads. Another
important thing to note is that this is almost a 50% decrease in memory latency when compared
to the previous generation Xeon 5400 series processor on 5000P chipset platforms.
内存延迟:1333对1066MHZ内存的提升在10%左右,但是55系列CPU对于54系列CPU总体上有50%的提升。
表2
A better indicator of application performance is memory throughput. We use the triad component
of the streams benchmark to compare the performance at different memory speeds. The memory
throughput assumes all local memory allocation and all 8 cores utilizing main memory. As shown
in 表3, the performance gain from running memory at 1066MHz versus 800MHz is 28%, and
the performance gain from running at 1333MHz versus 1066MHz is 9%. So, the performance
penalty of clocking memory down to 800MHz is far greater than clocking it down to 1066MHz.
This new processor design comes with some trade-offs in memory capacity, performance, and
cost: For example, more lower-cost/lower-capacity DIMMs mean lower memory speed.
Alternatively, fewer higher-capacity DIMMs cost more but offer higher performance.
注意,内存频率从1333降到1066比从1066降到800损失要小。
表3
Regardless of memory speed, the Xeon 5500 platform represents a significant improvement in
memory bandwidth over the previous Xeon 5400 platform. At 1333MHz, the improvement is
almost 500% over the previous generation. This huge improvement is mainly due to dual
integrated memory controllers and faster DDR-3 1333MHz memory. This improvement translates
into improved application performance and scalability.
至强55系列CPU比之前的54系列CPU的内存带宽提高了将近500%
2.1.5 Application Performance
In this section, we will discuss the impact of memory speed on the performance of three
commonly used benchmarks: SPECint®2006_rate, SPECfp®2006_rate and SPECjbb®2005. In
each case, the benchmark scores are relative to the score at 800MHz as shown in Figure 8.
SPECint2006_rate is typically used as an indicator of performance for commercial applications. It
tends to be more sensitive to processor frequency and less to memory bandwidth. There are very
few components in SPECint2006_rate that are memory bandwidth intensive and so the
performance gain with memory speed improvements is the least for this workload. In fact, most of
the difference observed is due to one of the sub-benchmarks that shows a high sensitivity to
memory frequency. There is an 8% improvement going from 800MHz to 1333MHz while the
improvement in memory bandwidth is almost 40%.
SPECfp_rate is used as an indicator for HPC (high-performance computing) workloads. It tends
to be memory bandwidth intensive and should reveal significant improvements for this workload
as memory frequency increases. As expected, a number of sub-benchmarks demonstrate
improvements as high as the difference in memory bandwidth. As shown in Figure 8, there is a
13% gain going from 800MHz to 1066MHz and another 6% improvement with 1333MHz.
SPECfp_rate captures almost 50% of the memory bandwidth improvement.
SPECjbb2005 is a workload that does not stress memory but keeps the data bus moderately
utilized. This workload provides a middle ground and the performance gains reflect that trend. As
shown in 表4, there is an 8% gain from 800MHz to 1066MHz and another 2% upside with
1333MHz.
表4
2.2 Memory Interleaving
Memory interleaving refers to how physical memory is interleaved across the physical DIMMs. A
balanced system provides the best interleaving. A Xeon 5500 series processor-based system is
balanced when all memory channels on a socket have the same amount of memory. The
simplest way to enforce optimal interleaving is by populating 6 identical DIMMs at 1333MHz, 12
identical DIMMs at 1066MHz and 18 identical DIMMs (where supported by platform) at 800MHz.
This leads to lessened performance. Figure 9 shows the impact of reduced interleaving. The first
configuration is a balanced baseline configuration where the memory is down-clocked to 800MHz
in BIOS. The second configuration populates four channels with 50% more memory than two
other channels causing an unbalanced configuration. The third configuration balances the
memory on all channels by populating the channels with fewer DIMM slots with a DIMM that is
double the capacity of others. (For example, two channels with 3 x 4GB DIMMs and one channel
with 1 x 4GB and 1 x 8GB DIMMs.) This ensures that all channels have the same capacity. As
表6 shows, the first and third balanced configurations significantly outperform the
unbalanced configuration. Depending on the memory footprint of the application and memory
access pattern, the impact could be higher or lower than the two applications cited in the figure.
注意,内存越多,内存的工作频率越低,12DIMMS工作在1066MHZ,18DIMMS工作在800MHZ,具体请看表7.
表6,表7
2.3 Memory Ranks
A memory rank is simply a segment of memory that is addressed by a specific address bit.
DIMMs typically have 1, 2 or 4 memory ranks, as indicated by their size designation.
• A typical memory DIMM description: 2GB 4R x8 DIMM
• The 4R designator is the rank count for this particular DIMM (R for rank = 4)
• The x8 designator is the data width of the rank
It is important to ensure that DIMMs with the appropriate number of ranks are populated in each
channel for optimal performance. Whenever possible, it is recommended to use dual-rank DIMMs
in the system. Dual-rank DIMMs offer better interleaving and hence better performance than
single-rank DIMMs. For instance, a system populated with 6 x 2GB dual-rank DIMMs outperforms
a system populated with 6 x 2GB single-rank DIMMs by 7% for SPECjbb2005. Dual-rank DIMMs
are also better than quad-rank DIMMs because quad-rank DIMMs will cause the memory speed
to be down-clocked.
Another important guideline is to populate equivalent ranks per channel. For instance, mixing
single-rank and dual-rank DIMMs in a channel should be avoided.
RANK指的是内存的生产工艺,每个通道可以支持的RANK总数是有限的,实际应用的时候应该保证内存大小与内存频率上的平衡。往往推荐使用双RANK的内存。
2.4 Memory Population across Memory Channels
It is important to ensure that all three memory channels in each processor are populated. The
relative memory bandwidth is shown in Figure 10, which illustrates the loss of memory bandwidth
as the number of channels populated decreases. This is because the bandwidth of all the
memory channels is utilized to support the capability of the processor. So, as the channels are
decreased, the burden to support the requisite bandwidth is increased on the remaining channels,
causing them to become a bottleneck.
表8
2.5 Memory Population Across Processor Sockets
Because the Xeon 5500 series uses NUMA architecture, it is important to ensure that both
memory controllers in the system are utilized, by providing both processors with memory. If only
one processor is installed, only the associated DIMM slots can be used. Adding a second
processor not only doubles the amount of memory available for use, but also doubles the number
of memory controllers, thus doubling the system memory bandwidth. It is also optimal to populate
memory for both processors in an identical fashion to provide a balanced system. Using Figure
11 as an example, Processor 0 has DIMMs populated but no DIMMs are populated for Processor
1. In this case, Processor 0 will have access to low latency local memory and high memory
bandwidth. However, Processor 1 has access only to remote or “far” memory. So, threads
executing on Processor 1 will have a long latency to access memory as compared to threads on
Processor 0.
This is due to the latency penalty incurred to traverse the QPI links to access the data on the
remote memory controller. The latency to access remote memory is almost 75% higher than local
memory access. The bandwidth to remote memory is also limited by the capability of the QPI
links. So, the goal should be to always populate both processors with memory.
表9
3.0 Best Practices
(最优配置方法)
In this section, we recapture the various rules to be followed for optimal memory configuration on
the Xeon 5500 based platforms.
3.1 Maximum Performance
Follow these rules for peak performance:
• Always populate both processors with equal amounts of memory to ensure a balanced
NUMA system.(两CPU使用相同容量内存)
• Always populate all 3 memory channels on each processor with equal memory capacity.
(每个CPU的3个内存通道使用相同容量的内存)
• Ensure an even number of ranks are populated per channel.
(每个通道占用的合适的RANK数)
• Use dual-rank DIMMs whenever appropriate.
(可以的话使用双RANK的内存)
• For optimal 1333MHz performance, populate 6 dual-rank DIMMs (3 per processor).
• For optimal 1066MHz performance, populate 12 dual-rank DIMMs (6 per processor).
• For optimal 800MHz performance with high DIMM counts:
– On 12 DIMM platforms, populate 12 dual-rank or quad-rank DIMMs (6 ) per processor.
– On 16 DIMM platforms:
Populate 12 dual-rank or quad-rank DIMMs (6 per processor).
Populate 14 dual-rank DIMMs of one size and 2 dual-rank DIMMs of double the size
as described in the interleaving section.
• With the above rules, it is not possible to have a performance-optimized system with 4GB,
8GB, 16GB, or 128GB. With 3 memory channels and interleaving rules, customers need to
configure systems with 6GB, 12GB, 18GB, 24GB, 48GB, 72GB, 96GB, etc., for optimized
performance.
3.2 Other Considerations
3.2.1 Plugging Order
Take care to populate empty DIMM sockets in the specific order for each platform when adding
DIMMs to Xeon 5500 series platforms, The DIMM socket farthest away from its associated
processor, per memory channel, is always plugged first. Consult the documentation with your
specific system for details.
3.2.2 Power Guidelines
This document is focused on maximum performance configuration for Xeon 5500 series
processor-based systems. Here are a few power guidelines for consideration:
• Fewer larger DIMMs (for example 6 x 4GB DIMMs vs. 12 x 2GB DIMMs will generally have
lower power requirements
• x8 DIMMs (x8 data width of rank, see section 3.3) will generally draw less power than
equivalently sized x4 DIMMs
• Consider BIOS configuration settings (see section 4.2.4)
3.2.3 Reliability
Here are two reliability guidelines for consideration:
• Using fewer, larger DIMMs (for example 6 x 4 GB DIMMs vs. 12 x 2GB DIMMs is generally
more reliable
• Xeon 5500 series memory controllers support IBM Chipkill™ memory protection technology
with x4 DIMMs (x4 data width of rank; see sect. 3.3), but not with x8 DIMMs
3.2.4 BIOS Configuration Settings
There are a number of BIOS configuration settings on servers using the Xeon 5500 series
processors that can also affect memory performance or benchmark results. For example, most
platforms allow the option of decreasing the memory clock speed below the supported maximum.
This may be useful for power savings but, obviously, decreases memory performance.
Meanwhile, options like Hyper-Threading Technology (formerly known as Simultaneous Multi-
Threading) and Turbo Boost Technology can also significantly affect benchmark results. Specific
memory configuration settings important to performance include:
表10
原文作者:
Ganesh Balakrishnan
IBM System x and BladeCenter Performance
Ralph M. Begun
IBM System x Development