转载自:http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files
Previously we looked at how the kernelmanages virtual memoryfor a user process, but files and I/O were left out.This post covers the important and often misunderstood relationship between files and memory and its consequences for performance.
Two serious problems must be solved by the OS when it comes to files.The first one is the mind-blowing slowness of hard drives, anddisk seeks in particular, relative to memory.The second is the need to load file contents in physical memory once andsharethe contents among programs.If you useProcess Explorerto poke at Windows processes, you'll see there are ~15MB worth of common DLLs loaded in every process.My Windows box right now is running 100 processes, so without sharing I'd be using up to ~1.5 GB of physical RAMjust for common DLLs.No good.Likewise, nearly all Linux programs need ld.so and libc, plus other common libraries.
Happily, both problems can be dealt with in one shot: thepage cache, where the kernel stores page-sized chunks of files.To illustrate the page cache, I'll conjure a Linux program namedrender, which opens filescene.datand reads it 512 bytes at a time, storing the file contents into a heap-allocated block.The first read goes like this:
After 12KB have been read,render's heap and the relevant page frames look thus:
This looks innocent enough, but there's a lot going on.First, even though this program uses regularreadcalls, three 4KB page frames are now in the page cache storing part ofscene.dat.People are sometimes surprised by this, butall regular file I/O happens through the page cache.In x86 Linux, the kernel thinks of a file as a sequence of 4KB chunks.If you read a single byte from a file, the whole 4KB chunk containing the byte you asked for is read from disk and placed into the page cache.This makes sense because sustained disk throughput is pretty good and programs normally read more than just a few bytes from a file region.The page cache knows the position of each 4KB chunk within the file, depicted above as #0, #1, etc. Windows uses 256KBviewsanalogous to pages in the Linux page cache.
Sadly, in a regular file read the kernel must copy the contents of the page cache into a user buffer, which not only takes cpu time and hurts thecpu caches, but alsowastes physical memory with duplicate data.As per the diagram above, thescene.datcontents are stored twice, and each instance of the program would store the contents an additional time.We've mitigated the disk latency problem but failed miserably at everything else.Memory-mapped filesare the way out of this madness:
When you use file mapping, the kernel maps your program's virtual pages directly onto the page cache.This can deliver a significant performance boost:Windows System Programmingreports run time improvements of 30% and up relative to regular file reads, while similar figures are reported for Linux and Solaris inAdvanced Programming in the Unix Environment.You might also save large amounts of physical memory, depending on the nature of your application.
As always with performance,measurement is everything, but memory mapping earns its keep in a programmer's toolbox.The API is pretty nice too, it allows you to access a file as bytes in memory and does not require your soul and code readability in exchange for its benefits.Mind youraddress spaceand experiment withmmapin Unix-like systems,CreateFileMappingin Windows, or the many wrappers available in high level languages.When you map a file its contents are not brought into memory all at once, but rather on demand viapage faults.The fault handlermaps your virtual pagesonto the page cache afterobtaininga page frame with the needed file contents.This involves disk I/O if the contents weren't cached to begin with.
Now for a pop quiz.Imagine that the last instance of ourrenderprogram exits.Would the pages storingscene.datin the page cache be freed immediately?People often think so, but that would be a bad idea.When you think about it, it is very common for us to create a file in one program, exit, then use the file in a second program.The page cache must handle that case.When you thinkmoreabout it, why should the kerneleverget rid of page cache contents?Remember that disk is 5 orders of magnitude slower than RAM, hence a page cache hit is a huge win.So long as there's enough free physical memory, the cache should be kept full.It is thereforenotdependent on a particular process, but rather it's a system-wide resource.If you runrendera week from now andscene.datis still cached, bonus!This is why the kernel cache size climbs steadily until it hits a ceiling.It's not because the OS is garbage and hogs your RAM, it's actually good behavior because in a way free physical memory is a waste.Better use as much of the stuff for caching as possible.
Due to the page cache architecture, when a program callswrite()bytes are simply copied to the page cache and the page is marked dirty.Disk I/O normally doesnothappen immediately, thus your program doesn't block waiting for the disk.On the downside, if the computer crashes your writes will never make it, hence critical files like database transaction logs must befsync()ed (though one must still worry about drive controller caches, oy!).Reads, on the other hand, normally block your program until the data is available.Kernels employ eager loading to mitigate this problem, an example of which isread aheadwhere the kernel preloads a few pages into the page cache in anticipation of your reads.You can help the kernel tune its eager loading behavior by providing hints on whether you plan to read a file sequentially or randomly (seemadvise(),readahead(),Windows cache hints).Linuxdoes read-aheadfor memory-mapped files, but I'm not sure about Windows.Finally, it's possible to bypass the page cache usingO_DIRECTin Linux orNO_BUFFERINGin Windows, something database software often does.
A file mapping may beprivateorshared.This refers only toupdatesmade to the contents in memory: in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are.Kernels use thecopy on writemechanism, enabled by page table entries, to implement private mappings.In the example below, bothrenderand another program calledrender3d(am I creative or what?) have mappedscene.datprivately.Renderthen writes to its virtual memory area that maps the file:
The read-only page table entries shown above donotmean the mapping is read only, they're merely a kernel trick to share physical memory until the last possible moment.You can see how 'private' is a bit of a misnomer until you remember it only applies to updates.A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programsas long as the page has only been read from.Once copy-on-write is done, changes by others are no longer seen.This behavior is not guaranteed by the kernel, but it's what you get in x86 and makes sense from an API perspective.By contrast, a shared mapping is simply mapped onto the page cache and that's it.Updates are visible to other processes and end up in the disk.Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
Dynamically loaded libraries are brought into your program's address space via file mapping.There's nothing magical about it, it's the same private file mapping available to you via regular APIs.Below is an example showing part of the address spaces from two running instances of the file-mappingrenderprogram, along with physical memory, to tie together many of the concepts we've seen.
This concludes our 3-part series on memory fundamentals.I hope the series was useful and provided you with a good mental model of these OS topics.Next week there's one more post on memory usage figures, and then it's time for a change of air.Maybe some Web 2.0 gossip or something.
/*********************************************************************
google机器人翻译的结果,参考一下
**********************************************************************/
在此之前,我们看到内核是如何管理虚拟内存的用户进程,文件和I / O被排除在外。这篇文章涵盖了重要的和经常被误解的文件和内存之间的关系及其后果的性能。
时,操作系统处理文件的时候,必须解决两个严重的问题。第一个是令人兴奋的缓慢的硬盘驱动器和磁盘寻道尤其是,相对于内存。第二个是需要在物理内存中加载的文件内容和程序间的共享内容。如果您使用的进程资源管理器在Windows进程捅,你会看到有〜15MB值得共同在每个进程加载的DLL。现在我的Windows中运行100个进程,所以没有交流,我会使用通用的DLL〜1.5GB的物理RAM。没有好。同样,几乎所有的Linux程序需要ld.so和libc,再加上其他常用的库。
令人高兴的是,这两个问题可以处理一炮打响的页面缓存,内核文件页面大小的块存储。要说明的页面缓存,我会想象命名渲染,,打开文件scene.dat和读取512个字节的时间,堆分配的内存块存储的文件的内容复制到一个Linux程序。第一次读是这样的:
已读12KB后,渲染的堆的相关页面帧看起来是这样的:
这看起来天真,但有很多事情。首先,尽管这个程序使用常规的读取调用,3个4KB的页框现在在页面缓存中存储的一部分,scene.dat。人们有时感到意外,但页面缓存的情况,通过所有常规文件I / O。在x86的Linux,内核认为作为一个4KB的块序列的文件。如果您从文件读取一个字节,包含字节的整个4KB的块,你问的是从磁盘读取并放置到页面缓存。这是有道理的,因为持续的磁盘吞吐量是相当不错的,程序和平时的阅读以上的只有几个字节的文件区域。页面缓存知道为#0,#1,Windows使用256KB的意见,类似于Linux的页面缓存中的页面,上面所描述的每一个4KB的块在文件中的位置。
可悲的是,一个普通的文件中读取内核的页面高速缓存中的内容复制到用户缓冲区,这不仅会占用CPU时间和伤害的CPU缓存,但也浪费与重复数据的物理内存。根据上面的图表中,scene.dat内容存储两次,在程序的每个实例将存储的内容更多的时间。我们已经缓解磁盘延迟的问题,但一切悲惨地失败了。内存映射文件,这种疯狂的方式:
当您使用文件映射,映射的内核程序的虚拟页直接到页高速缓存。这可以提供显着的性能提升:Windows系统编程报告的运行时间提高30%和上升相对于普通的文件读取,而类似的数字是Linux和Solaris的Unix环境高级编程。你或许也可以节省大量的物理内存,这取决于你的应用程序的性质。
一如往常的性能,测量一切,但内存映射的收入保持在一个程序员的工具箱。API是相当不错,它可以让你访问一个文件在内存中以字节为单位,并不需要它的好处,以换取你的灵魂和代码的可读性。注意你的地址空间,在类Unix系统中的mmap实验,CreateFileMapping在Windows中,或在高级语言中的许多包装。当你映射一个文件,它的内容调入内存全部一次,而是要求通过页面错误。故障处理程序虚拟页面映射到所需的文件内容后获得一个页框的页面缓存。这涉及到磁盘I / O的内容缓存开始。
现在弹出一个测验。试想一下,我们的渲染程序退出的最后一个实例。会立即被释放的页面缓存的网页的存储scene.dat在吗?人们常常这样想,但是这将是一个不错的主意。当你想想看,这对我们来说是很常见的一个程序,退出来创建一个文件中,然后在第二个程序使用的文件。页面缓存必须处理这种情况。当你想更多地了解它,为什么要在内核摆脱页面缓存内容?请记住,磁盘的页面缓存命中5个数量级比RAM慢,因此是一个巨大的胜利。因此,只要有足够的空闲物理内存,缓存应该保持充分。因此,不依赖于一个特定的过程,而这是一个全系统的资源。如果你运行呈现从现在和scene.dat,一个星期仍缓存,奖金!这就是为什么内核缓存的大小稳步攀升,直到它碰到了天花板。这不是因为操作系统是垃圾和猪的RAM,它实际上是良好的行为,因为在某种程度上,可用的物理内存是一种浪费。更好地利用缓存尽可能多的东西。
简单复制的页面高速缓存架构,当程序调用写()个字节的页面缓存的页面被标记为“脏”。磁盘I / O一般不会立即发生,因此你的程序不阻塞等待磁盘。不利的一面,如果计算机崩溃,不会让你写,因此关键数据库的事务日志文件,如必须的fsync()编辑(尽管仍然必须担心驱动器控制器高速缓存,OY!)。读取,另一方面,正常情况下阻止你的程序,直到数据是可用的。的内核采用渴望加载到减轻这个问题,其中一个例子是预读的内核预装了几页到页缓存,期待你的读取。是否打算顺序或随机读取一个文件(见madvise() ,预读() ,Windows缓存提示)提供提示,可以帮助内核调渴望的装载行为。不预读Linux的内存映射文件,但我不知道有关Windows。最后,这是可能,使用O_DIRECT在Linux或NO_BUFFERING的在Windows,数据库软件的东西往往绕过页面缓存。
可能是私人或共享文件映射。这是指只在内存中的内容进行更新的:在一个私人的映射更新提交到磁盘或其他进程可见的,而他们是在一个共享的映射。内核使用写拷贝机制,启用的页表项,实现私有映射。在下面的例子中,无论是渲染和其他程序的render3d(我是创意还是什么?)映射scene.dat私下渲染虚拟内存的区域,然后写入映射文件:
只读的页表项以上并不意味着映射是只读的,它们仅仅是一个核心的技巧共享物理内存中,直到最后一刻。你可以看到'私人'是一个有点用词不当,直到你还记得它仅适用于更新。这种设计的后果是,一个文件映射的虚拟页面私下认为只要由其他程序的文件进行更改的页面已经被读出。一旦上写时复制完成后,不再被别人。这种行为是不能保证的内核,但它就是你在x86平台上,从API的角度来看是有道理的。与此相反,共享的映射是简单地映射到的页面缓存,这就是它。更新显示到其他进程,并结束了在磁盘上。最后,如果上面的映射是只读的,会触发页面错误的分割上写的故障,而不是复制。
动态加载的库被带进你的程序通过文件映射的地址空间。没有什么神奇的,它是相同的私人文件映射为您提供通过定期的API。下面是一个例子,从两个正在运行的实例的文件映射渲染程序的地址空间的一部分,随着物理内存,绑在一起,我们已经看到的许多概念。
这对内存的基本原理,总结了我们3部分组成的系列。我希望该系列是有益的,只要你以良好的精神的这些OS话题的模型。下周一人多岗的内存使用情况的数字,那么现在是时候改变的空气。也许一些Web 2.0的八卦什么的。