HugePages is afeature integrated into the Linux kernel with release 2.6. This featurebasically provides the alternative to the 4K page size (16Kfor IA64) providing bigger pages.
关于HugePages,有一些相关的专业术语,具体如下:
(1) Page Table: A page table is thedata structure of a virtual memory system in an operating system to store themapping between virtual addresses and physical addresses. This means that on avirtual memory system, the memory is accessed by first accessing a page tableand then accessing the actual memory location implicitly.
--Page Table 是操作系统上的虚拟内存系统的数据结构,其用来存储虚拟内存地址和物理内存地址之间的映射关系。这就意味着在虚拟内存系统上,我们访问内存时,是先访问Page Table,然后根据Page Table 中的映射关系,隐式的转移到物理的内存位置。
(2) TLB: A Translation LookasideBuffer (TLB) is a buffer (or cache) in a CPU that contains parts ofthe page table. This is a fixed size buffer being used to do virtual addresstranslation faster.
--TLB(Translation Lookaside Buffer) 是CPU 中的一块buffer 或者cache,其大小的固定的, TLB中包含了部分Page Table,用来快速进行虚拟地址的转换。
(3) hugetlb: This is an entryin the TLB that points to a HugePage (a large/big page larger than regular 4Kand predefined in size). HugePages are implemented via hugetlb entries, i.e. wecan say that a HugePage is handled by a "hugetlb page entry". The'hugetlb" term is also (and mostly) used synonymously with a HugePage(See Note261889.1). In this document the term "HugePage" is going to beused but keep in mind that mostly "hugetlb" refers to the sameconcept.
--hugetlb 是TLB中的一个entry,其指向HugePage(大于4k或预定义的一个large page)。 HugePage 通过hugetlb entries来实现,我们也可以说HugePage 是hugetlb page entry的一个句柄。 在MOS 文档:Note 261889.1中,二者是几乎是相同的概念。
(4) hugetlbfs: This is a newin-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocatedon hugetlbfs type filesystem are allocated in HugePages.
--hugetlbfs 是2.6内核中提出的一个新的in-memory filesystem,就像tmpfs一样。
WRONG: HugePages is a method to be able to use large SGA on 32-bit VLM systems |
RIGHT: HugePages is a method to have larger pages where it is useful for working with very large memory. It is both useful in 32- and 64-bit configurations |
WRONG: HugePages cannot be used without USE_INDIRECT_DATA_BUFFERS |
RIGHT: HugePages can be used without indirect buffers. 64-bit systems does not need to use indirect buffers to have a large buffer cache for the RDBMS instance and HugePages can be used there too. |
WRONG: hugetlbfs means hugetlb |
RIGHT: hugetlbfs is a filesystem type **BUT** hugetlb is the mechanism employed in the back where hugetlb can be employed WITHOUT hugetlbfs |
WRONG: hugetlbfs means hugepages |
RIGHT: hugetlbfs is a filesystem type **BUT** HugePages is the mechanism employed in the back (synonymously with hugetlb) where HugePages can be employed WITHOUT hugetlbfs. |
When a singleprocess works with a piece of memory, the pages that the process uses arereference in a local page table for the specific process. The entries in thistable also contain references to the System-Wide Page Table which actually hasreferences to actual physical memory addresses. So theoretically a user modeprocess (i.e. Oracle processes), follows its local page table to access to thesystem page table and then can reference the actual physical table virtually. Asyou can see below, it is also possible (and very common to Oracle RDBMS due toSGA use) that two different O/S processes can point to the same entry in thesystem-wide page table.
--当一个进程使用一块内存来工作时,进程使用的page 从local page table 中引用。 Local page table中的entries 又引用了System-Wide Page Table的page, 该page 指向了实际的物理内存地址。
所以,理论上,用户的进程(如oracle进程),根据local page table中的entry 指向了system page table中的entry,而System page table中的entry 指向了实际的物理内存。
当然,也有可能,2个不同的O/S 进程指向了system-wide page table 中同一个entry,如下图所示,最常见的原因是Oracle SGA的使用。
When HugePagesare in the play, the usual page tables are employed. The very basic differenceis that the entries in both process page table and the system page table hasattributes about huge pages. So any page in a page table can be a huge page ora regular page. The following diagram illustrates 4096K hugepages but thediagram would be the same for any huge page size.
--当配置了HugePage后,最基本的不同是 process page table 和 system page table中的entry 都包含了huge page的属性。所以page table 中的任一page 都可能是huge page 或者regular page。
(1) HugePages can be allocated on-the-fly but they must be reservedduring system startup. Otherwise the allocation might fail as the memory isalready paged in 4K mostly.
(2) HugePage sizes vary from 2MB to 256MB based onkernel version and HW architecture (See related sectionbelow.)
(3) HugePages are not subject to reservation / release after thesystem startup unless there is system administrator intervention, basicallychanging the hugepages configuration (i.e. number of pages available or poolsize)
(1) Notswappable: 不需要内存页交换
HugePages are not swappable. Therefore there is nopage-in/page-out mechanism overhead.HugePages are universally regarded aspinned.
(2)Relief of TLB pressure: 减轻TLB的压力
1)Hugepge uses fewer pages to cover thephysical address space, so the size of “book keeping” (mapping from the virtualto the physical address) decreases, so it requiring fewer entries in the TLB
2)TLB entries will cover a larger part ofthe address space when use HugePages, there will be fewer TLB misses before theentire or most of the SGA is mapped in the SGA
3)Fewer TLB entries for the SGA also meansmore for other parts of the address space
(3)Decreased page table overhead: 降低pagetable 的消耗
Each page table entry can be as large as64 bytes and if we are trying to handle 50GB of RAM, the pagetable will beapproximately 800MB in size which is practically will not fit in 880MB sizelowmem (in 2.4 kernels - the page table is not necessarily in lowmem in 2.6kernels) considering the other uses of lowmem. When 95% of memory is accessedvia 256MB hugepages, this can work with a page table of approximately 40MB intotal.
每个一个page table 的entry最大需要64 bytes的内存,如果我们管理50GB的内存,那么Pagetable 就需要约800MB的内存空间. 如果我们使用256MB的hugepage,同样对于50G的内存,我们只需要40MB的pagetable。
Dave 注释:
按普通模式,每个page 4k,那么需要的entries个数是:(50*1024*1024/4)
每个entry 是64bytes,所以总的内存大小就是:(50*1024*1024/4) * 64/1024/1024=800M
注意,这只是一个进程的page table,如果有10个进程,那么光处理这些page 就需要800*10,约8G的内存空间,而我们总共的内存也不过50G而已,所以大内存的情况下,需要HugePage就显的尤其重要。
HugePage 最大的大小从2M到256MB,按2MB算:
(50*1024/2)*64/1024/1024= 1.6M
10 进程也才16M而已。
(4)Eliminated page table lookup overhead: 降低page table 的lookup 次数
Since the pagesare not subject to replacement, page table lookups are not required.
(5)Faster overall memory performance: 提升内存的性能
On virtualmemory systems each memory operation is actually two abstract memoryoperations. Since there are fewer pages to work on, the possible bottleneck onpage table access is clearly avoided.
--virtual memory system 上的每一次内存操作实际上都需要2次内存的操作, hugepage减少了page数量从而避免了访问page table上的瓶颈。
单个HugePage的大小根据平台的不同而不同:
(1) Kernel version/linux distribution
(2) HW Platform
HugePage 的实际大小可以使用如下命令查看:
$grep Hugepagesize /proc/meminfo
The table belowshows the sizes of HugePages on different configurations. Note that these aregeneral numbers taken from the most recent versions of the kernels. For aspecific kernel source package, you can check for the HPAGE_SIZE macro value(based on HPAGE_SHIFT) for a different (more recent) kernel source tree.
--下表显示了不同平台下HugePages的值:
HW Platform |
Source Code Tree |
Kernel 2.4 |
Kernel 2.6 |
Linux x86 (IA32) |
i386 |
4 MB |
4 MB * |
Linux x86-64 (AMD64, EM64T) |
x86_64 |
2 MB |
2 MB |
Linux Itanium (IA64) |
ia64 |
256 MB |
256 MB |
IBM Power Based Linux (PPC64) |
ppc64/powerpc |
N/A ** |
16 MB |
IBM zSeries Based Linux |
s390 |
N/A |
N/A |
IBM S/390 Based Linux |
s390 |
N/A |
N/A |
* Some older packaging for the 2.6.5 kernel on SLES8 (like 2.6.5-7.97) can have2 MB Hugepagesize.
** Oracle RDBMS is also not certified in this configuration. See Document341507.1
The AMM and HugePages are not compatible.One needs to disable AMM on 11g to be able to use HugePages. See Document749851.1 for further information.
--Oracle 11g的AMM与HugePages不兼容。 需要注意。
在Linux OS下,如果对delicate进程没有配置合适的的HugePage,那么可能会遇到如下的问题:
(1) HugePages not used (HugePages_Total = HugePages_Free) at all wastingthe amount configured for
(2) Poor database performance 影响数据库性能
(3) System running out of memory or excessive swapping 内存不足或者经常需要进行swap
(4) Some or any database instancecannot be started 某些数据库实例不能启动
(5) Crucial system services failing(e.g.: CRS) 严重的系统故障
To avoid / helpwith such situations Bug10153816 was filed to introduce a database initialization parameter in11.2.0.2 (use_large_pages) to help manage which SGAs will use huge pages andpotentially give warnings or not start up at all if they cannot get thosepages.
HugePages iscrucial for faster Oracle database performance on Linux if you have a large RAMand SGA. If your combined database SGAs is large (like more than 8GB, can evenbe important for smaller), you will need HugePages configured. Note that thesize of the SGA matters. Advantages of HugePages are:
--如果使用了大内存和SGA,那么HugePage对提高数据库性能就非常重要。如果数据库SGA脚本,比如超过8G,就需要配置HugePages。配置HugePages 有如下好处:
(1) Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. Thatmeans the system would need to handle 512 times less pages.
(2) No Page Table Lookups:Since the HugePages are not subject to replacement (despite regular pages),page table lookups are not required.
(3) Better Overall Memory Performance: On virtual memory systems (any modern OS) each memory operation isactually two abstract memory operations. With HugePages, since there are lessnumber of pages to work on, the possible bottleneck on page table access isclearly avoided.
(4) No Swapping: We must avoidswapping to happen on Linux OS at all Document1295478.1. HugePages are not swappable (whereas regular pages are).Therefore there is no page replacement mechanism overhead. HugePages areuniversally regarded as pinned.
(5) No 'kswapd' Operations: kswapdwill get very busy if there is a very large area to be paged (i.e. 13 millionpage table entries for 50GB memory) and will use an incredible amount of CPUresource. When HugePages are used, kswapd is not involved in managing them. Seealso Document361670.1
在/etc/security/limits.conf文件中添加memlock的限制,注意该值略微小于实际物理内存的大小。 比如物理内存是64GB,可以设置为如下:
* soft memlock 60397977
* hard memlock 60397977
如果这里的值超过了SGA的需求,也没有不利的影响。
如果使用了Oracle Linux的oracle-validated包,或者Exadata DB compute会自动配置这个参数。
使用如下命令查看参数值:
$ ulimit -l
60397977
如果Oracle 是11g以后的版本,那么默认创建的实例会使用Automatic Memory Management (AMM)的特性,该特性与HugePage不兼容。
在设置HugePage之前需要先禁用AMM。设置初始化参数MEMORY_TARGET 和MEMORY_MAX_TARGET 为0即可。
使用AMM的情况下,所有的SGA 内存都是在/dev/shm 下分配的,因此在分配SGA时不会使用HugePage。这也是AMM 与HugePage不兼容的原因。
另外:默认情况下ASM instance 也是使用AMM的,但因为ASM 实例不需要大SGA,所以对ASM 实例使用HugePages意义不大。
如果我们要使用HugePage,那么就必须先确保没有设置MEMORY_TARGET/ MEMORY_MAX_TARGET参数。
确保所有的数据库实例都已经启动,包括ASM 实例。使用hugepages_settings.sh 脚本获取thevm.nr_hugepages 内核参数的建议值。
$ ./hugepages_settings.sh
...
Recommended setting: vm.nr_hugepages = 1496
$
也可以根据自己的经验来计算该值。
脚本如下:
#!/bin/bash
#
#hugepages_settings.sh
#
# Linux bash scriptto compute values for the
# recommendedHugePages/HugeTLB configuration
#
# Note: This scriptdoes calculation for all shared memory
# segmentsavailable when the script is run, no matter it
# is an OracleRDBMS shared memory segment or not.
#
# This script isprovided by Doc ID 401749.1 from My Oracle Support
#http://support.oracle.com
# Welcome text
echo "
This script isprovided by Doc ID 401749.1 from My Oracle Support
(http://support.oracle.com)where it is intended to compute values for
the recommendedHugePages/HugeTLB configuration for the current shared
memory segments.Before proceeding with the execution please note following:
* For ASMinstance, it needs to configure ASMM instead of AMM.
* The'pga_aggregate_target' is outside the SGA and
youshould accommodate this while calculating SGA size.
* In case youchanges the DB SGA size,
as thenew SGA will not fit in the previous HugePages configuration,
it hadbetter disable the whole HugePages,
startthe DB with new SGA size and run the script again.
And make sure that:
* OracleDatabase instance(s) are up and running
* OracleDatabase 11g Automatic Memory Management (AMM) is not setup
(SeeDoc ID 749851.1)
* The sharedmemory segments can be listed by command:
# ipcs -m
Press Enter toproceed..."
read
# Check for thekernel version
KERN=`uname -r |awk -F. '{ printf("%d.%d\n",$1,$2); }'`
# Find out theHugePage size
HPG_SZ=`grepHugepagesize /proc/meminfo | awk '{print $2}'`
if [ -z"$HPG_SZ" ];then
echo"The hugepages may not be supported in the system where the script isbeing executed."
exit 1
fi
# Initialize thecounter
NUM_PG=0
# Cumulative numberof pages required to handle the running shared memory segments
for SEG_BYTES in`ipcs -m | cut -c44-300 | awk '{print $1}' | grep "[0-9][0-9]*"`
do
MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`
if [ $MIN_PG -gt 0 ]; then
NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`
fi
done
RES_BYTES=`echo"$NUM_PG * $HPG_SZ * 1024" | bc -q`
# An SGA less than100MB does not make sense
# Bail out if thatis the case
if [ $RES_BYTES -lt100000000 ]; then
echo "***********"
echo "** ERROR **"
echo "***********"
echo "Sorry! There are not enough total of shared memory segmentsallocated for
HugePagesconfiguration. HugePages can only be used for shared memory segments
that you can listby command:
# ipcs -m
of a size that canmatch an Oracle Database SGA. Please make sure that:
* OracleDatabase instance is up and running
* OracleDatabase 11g Automatic Memory Management (AMM) is not configured"
exit 1
fi
# Finish withresults
case $KERN in
'2.4') HUGETLB_POOL=`echo "$NUM_PG*$HPG_SZ/1024" | bc -q`;
echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;
'2.6') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;
*) echo "Unrecognized kernel version $KERN. Exiting." ;;
esac
# End
...
vm.nr_hugepages = 1496
...
在重启系统之后,确保所有的数据库实例都已经启动,使用如下命令检查HugePage的状态:
# grep HugePages /proc/meminfo
HugePages_Total: 1496
HugePages_Free: 485
HugePages_Rsvd: 446
HugePages_Surp: 0
为了确保HugePages配置的有效性,HugePages_Free值应该小于HugePages_Total 的值,并且应该等于HugePages_Rsvd的值。
Hugepages_Free 和HugePages_Rsvd 的值应该小于SGA 分配的gages。
一些常见的问题如下:
Symptom |
Possible Cause |
Troubleshooting Action |
System is running out of memory or swapping |
Not enough HugePages to cover the SGA(s) and therefore the area reserved for HugePages are wasted where SGAs are allocated through regular pages. |
Review your HugePages configuration to make sure that all SGA(s) are covered. |
Databases fail to start |
memlock limits are not set properly |
Make sure the settings in limits.conf apply to database owner account. |
One of the database fail to start while another is up |
The SGA of the specific database could not find available HugePages and remaining RAM is not enough. |
Make sure that the RAM and HugePages are enough to cover all your database SGAs |
Cluster Ready Services (CRS) fail to start |
HugePages configured too large (maybe larger than installed RAM) |
Make sure the total SGA is less than the installed RAM and re-calculate HugePages. |
HugePages_Total = HugePages_Free |
HugePages are not used at all. No database instances are up or using AMM. |
Disable AMM and make sure that the database instances are up. |
Database started successfully and the performance is slow |
The SGA of the specific database could not find available HugePages and therefore the SGA is handled by regular pages, which leads to slow performance |
Make sure that the HugePages are many enough to cover all your database SGAs |
HugePages and Oracle Database 11g AutomaticMemory Management (AMM) on Linux [ID 749851.1]
Hugepages are Not used by Database BufferCache [ID 829850.1]
Oracle Not Utilizing Hugepages [ID803238.1]
/proc/meminfo Does Not Provide HugePagesInformation on Oracle Enterprise Linux (OEL5) [ID 860350.1]
HugePages Not Released On Oracle RDBMSInstance Shutdown with RHEL / EL 5 Update 1 (5.1) [ID 550443.1]
Shell Script to Calculate ValuesRecommended Linux HugePages / HugeTLB Configuration [ID 401749.1]
HugePages on Oracle Linux 64-bit [ID 361468.1]
HugePages on Linux: What It Is... and WhatIt Is Not... [ID 361323.1]
Document749851.1 HugePages and Oracle Database 11g Automatic Memory Management(AMM) on Linux
Document829850.1 Hugepages Are Not Used by Database Buffer Cache
Document803238.1 Oracle Not Utilizing Hugepages
Document728063.1 Setup HugePages in an Guest Does Not Work with Oracle VM 2.1or 2.1.1
Document550443.1 HugePages Not Released On Oracle RDBMS Instance Shutdown withRHEL / EL 5 Update 1 (5.1)
Document860350.1 /proc/meminfo Does Not Provide HugePages Information onOracle Enterprise Linux (OEL5)
---------------------------------------------------------------------------------------
版权所有,文章允许转载,但必须以链接方式注明源地址,否则追究法律责任!
Skype: tianlesoftware
Email: [email protected]
Blog: http://blog.csdn.net/tianlesoftware
Weibo: http://weibo.com/tianlesoftware
Twitter: http://twitter.com/tianlesoftware
Facebook: http://www.facebook.com/tianlesoftware
Linkedin: http://cn.linkedin.com/in/tianlesoftware