原文地址:
《Playing with Virtual Memory》
http://www.snailinaturtleneck.com/blog/2011/08/30/playing-with-virtual-memory/
扩展阅读:
《Understanding Memory》
http://www.ualberta.ca/CNS/RESEARCH/LinuxClusters/mem.html
《Understanding Virtual Memory》
http://www.redhat.com/magazine/001nov04/features/vm/
《MongoDB与内存》
http://huoding.com/2011/08/19/107
怎么玩转虚拟内存
当你运行一个进程,它需要一些内存来存储:堆,栈,使用的其他库。Linux为你的进程提供和清理内存像一个认真的管家。你可以(一般应该)只是让Linux做它自己的事情,但是知道发生了什么是一个很好的主意。
我想一个简单的方法去搞懂看看发生什么是使用pmap命令。pmap告诉你一个给定进程的内存信息。
例如,让我们看一个非常简单的C程序,打印出其自身的进程ID(PID),并暂停:
#include <stdio.h> #include <unistd.h> #include <sys/types.h> int main() { printf("run `pmap %d`\n", getpid()); pause(); }
保存为mem_munch.c。现在编译并运行它:
# gcc mem_munch.c -o mem_munch # ./mem_munch run `pmap 25681`
你得到的PID可能会与我的(25681)不同。
在这一点上,该方案将“挂起”。这是因为暂停函数pause(),这正是我们想要的。现在,在我们空闲的时候可以看看关于这个进程的内存。
开启一个新的shell并运行PMAP,用下面的一个mem_munch更换PID:
# pmap 25681 25681: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 00007fcf5af88000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b112000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b311000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b315000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b316000 24K rw--- [ anon ] 00007fcf5b31c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b512000 12K rw--- [ anon ] 00007fcf5b539000 12K rw--- [ anon ] 00007fcf5b53c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b53d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff7efd8000 132K rw--- [ stack ] 00007fff7efff000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 3984K
未完,时间不够下次再写了,贴原文吧。
=========================================
When you run a process, it needs some memory to store things: its heap, its stack, and any libraries it’s using. Linux provides and cleans up memory for your process like an extremely conscientious butler. You can (and generally should) just let Linux do its thing, but it’s a good idea to understand the basics of what’s going on.
One easy way (I think) to understand this stuff is to actually look at what’s going on using the pmap command. pmap shows you memory information for a given process.
For example, let’s take a really simple C program that prints its own process id (PID) and pauses:
#include <stdio.h> #include <unistd.h> #include <sys/types.h> int main() { printf("run `pmap %d`\n", getpid()); pause(); }
$ gcc mem_munch.c -o mem_munch $ ./mem_munch run `pmap 25681`
The PID you get will probably be different than mine (25681).
At this point, the program will “hang.” This is because of the pause() function, and it’s exactly what we want. Now we can look at the memory for this process at our leisure.
Open up a new shell and run pmap, replacing the PID below with the one mem_munch gave you:
$ pmap 25681 25681: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 00007fcf5af88000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b112000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b311000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b315000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b316000 24K rw--- [ anon ] 00007fcf5b31c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b512000 12K rw--- [ anon ] 00007fcf5b539000 12K rw--- [ anon ] 00007fcf5b53c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b53d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff7efd8000 132K rw--- [ stack ] 00007fff7efff000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 3984K
This output is how memory “looks” to the mem_munch process. If mem_munch asks the operating system for 00007fcf5af88000
, it will get libc. If it asks for 00007fcf5b31c000
, it will get the ld library.
This output is a bit dense and abstract, so let’s look at how some more familiar memory usage shows up. Change our program to put some memory on the stack and some on the heap, then pause.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> int main() { int on_stack, *on_heap; // local variables are stored on the stack on_stack = 42; printf("stack address: %p\n", &on_stack); // malloc allocates heap memory on_heap = (int*)malloc(sizeof(int)); printf("heap address: %p\n", on_heap); printf("run `pmap %d`\n", getpid()); pause(); }
$ ./mem_munch stack address: 0x7fff497670bc heap address: 0x1b84010 run `pmap 11972`
Again, your exact numbers will probably be different than mine.
Before you kill mem_munch, run pmap on it:
$ pmap 11972 11972: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 0000000001b84000 132K rw--- [ anon ]00007f3ec4d98000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec4f22000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5121000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5125000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5126000 24K rw--- [ anon ] 00007f3ec512c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec5322000 12K rw--- [ anon ] 00007f3ec5349000 12K rw--- [ anon ] 00007f3ec534c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec534d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff49747000 132K rw--- [ stack ] 00007fff497bb000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 4116K
Note that there’s a new entry between the final mem_munch section and libc-2.13.so. What could that be?
# from pmap
0000000001b84000 132K rw--- [ anon ]
# from our program
heap address: 0x1b84010
The addresses are almost the same. That block ([ anon ]
) is the heap. (pmap labels blocks of memory that aren’t backed by a file [ anon ]
. We’ll get into what being “backed by a file” means in a sec.)
The second thing to notice:
# from pmap
00007fff49747000 132K rw--- [ stack ]
# from our program
stack address: 0x7fff497670bc
And there’s your stack!
One other important thing to notice: this is how memory “looks” to your program, not how memory is actually laid out on your physical hardware. Look at how much memory mem_munch has to work with. According to pmap, mem_munch can address memory between address 0x0000000000400000
and 0xffffffffff600000
(well, actually 0x00007fffffffffffffff
, beyond that is special). For those of you playing along at home, that’s almost 10 million terabytes of memory. That’s a lot of memory. (If your computer has that kind of memory, please leave your address and times you won’t be at home.)
So, the amount of memory the program can address is kind of ridiculous. Why does the computer do this? Well, lots of reasons, but one important one is that this means you can address more memory than you actually have on the machine and let the operating system take care of making sure the right stuff is in memory when you try to access it.
Memory mapping a file basically tells the operating system to load the file so the program can access it as an array of bytes. Then you can treat a file like an in-memory array.
For example, let’s make a (pretty stupid) random number generator ever by creating a file full of random numbers, then mmap-ing it and reading off random numbers.
First, we’ll create a big file called random (note that this creates a 1GB file, so make sure you have the disk space and be patient, it’ll take a little while to write):
$ dd if=/dev/urandom bs=1024 count=1000000 of=/home/user/random 1000000+0 records in 1000000+0 records out 1024000000 bytes (1.0 GB) copied, 123.293 s, 8.3 MB/s $ ls -lh random -rw-r--r-- 1 user user 977M 2011-08-29 16:46 random
Now we’ll mmap random and use it to generate random numbers.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } // we want to inspect memory before mapping the file printf("run `pmap %d`, then press <enter>", getpid()); getchar(); random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { perror("error mapping the file"); return -1; } while (1) { printf("random number: %d (press <enter> for next number)", *(int*)(random_bytes+offset)); getchar(); offset += 4; } }
$ ./mem_munch run `pmap 12727`, then press <enter>
The program hasn’t done anything yet, so the output of running pmap will basically be the same as it was above (I’ll omit it for brevity). However, if we continue running mem_munch by pressing enter, our program will mmap random.
Now if we run pmap it will look something like:
$ pmap 12727 12727: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 000000000147d000 132K rw--- [ anon ] 00007fe261c6f000 976564K r--s- /home/user/random00007fe29d61c000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d7a6000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a5000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a9000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9aa000 24K rw--- [ anon ] 00007fe29d9b0000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dba6000 12K rw--- [ anon ] 00007fe29dbcc000 16K rw--- [ anon ] 00007fe29dbd0000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dbd1000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007ffff29b2000 132K rw--- [ stack ] 00007ffff29de000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 980684K
This is very similar to before, but with an extra line (bolded), which kicks up virtual memory usage a bit (from 4MB to 980MB).
However, let’s re-run pmap with the -x option. This shows the resident set size (RSS): only 4KB of random are resident. Resident memory is memory that’s actually in RAM. There’s very little of random in RAM because we’ve only accessed the very start of the file, so the OS has only pulled the first bit of the file from disk into memory.
pmap -x 12727 12727: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 0 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 000000000147d000 0 4 4 rw--- [ anon ] 00007fe261c6f000 0 4 0 r--s- random 00007fe29d61c000 0 288 0 r-x-- libc-2.13.so 00007fe29d7a6000 0 0 0 ----- libc-2.13.so 00007fe29d9a5000 0 16 16 r---- libc-2.13.so 00007fe29d9a9000 0 4 4 rw--- libc-2.13.so 00007fe29d9aa000 0 16 16 rw--- [ anon ] 00007fe29d9b0000 0 108 0 r-x-- ld-2.13.so 00007fe29dba6000 0 12 12 rw--- [ anon ] 00007fe29dbcc000 0 16 16 rw--- [ anon ] 00007fe29dbd0000 0 4 4 r---- ld-2.13.so 00007fe29dbd1000 0 8 8 rw--- ld-2.13.so 00007ffff29b2000 0 12 12 rw--- [ stack ] 00007ffff29de000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 508 100
If the virtual memory size (the Kbytes column) is all 0s for you, don’t worry about it. That’s a bug in Debian/Ubuntu’s -x option. The total is correct, it just doesn’t display correctly in the breakdown.
You can see that the resident set size, the amount that’s actually in memory, is tiny compared to the virtual memory. Your program can access any memory within a billion bytes of 0x00007fe261c6f000, but if it accesses anything past 4KB, it’ll probably have to go to disk for it*.
What if we modify our program so it reads the whole file/array of bytes?
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { printf("error mapping the file\n"); return -1; } for (offset = 0; offset < 1000000000; offset += 4) { int i = *(int*)(random_bytes+offset); // to show we're making progress if (offset % 1000000 == 0) { printf("."); } } // at the end, wait for signal so we can check mem printf("\ndone, run `pmap -x %d`\n", getpid()); pause(); }
$ pmap -x 5378 5378: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 4 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 0000000002271000 0 4 4 rw--- [ anon ] 00007fc2aa333000 0 976564 0 r--s- random 00007fc2e5ce0000 0 292 0 r-x-- libc-2.13.so 00007fc2e5e6a000 0 0 0 ----- libc-2.13.so 00007fc2e6069000 0 16 16 r---- libc-2.13.so 00007fc2e606d000 0 4 4 rw--- libc-2.13.so 00007fc2e606e000 0 16 16 rw--- [ anon ] 00007fc2e6074000 0 108 0 r-x-- ld-2.13.so 00007fc2e626a000 0 12 12 rw--- [ anon ] 00007fc2e6290000 0 16 16 rw--- [ anon ] 00007fc2e6294000 0 4 4 r---- ld-2.13.so 00007fc2e6295000 0 8 8 rw--- ld-2.13.so 00007fff037e6000 0 12 12 rw--- [ stack ] 00007fff039c9000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 977072 104
Now if we access any part of the file, it will be in RAM already. (Probably. Until something else kicks it out.) So, our program can access a gigabyte of memory, but the operating system can lazily load it into RAM as needed.
And that’s why your virtual memory is so damn high when you’re running MongoDB.
Left as an exercise to the reader: try running pmap on a mongod process before it’s done anything, once you’ve done a couple operations, and once it’s been running for a long time.
This isn’t strictly true**. The kernel actually says, “If they want the first N bytes, they’re probably going to want some more of the file” so it’ll load, say, the first dozen KB of the file into memory but only tell the process about 4KB. When your program tries to access this memory that is in RAM, but it didn’t know was in RAM, it’s called a minor page fault (as opposed to a major page fault when it actually has to hit disk to load new info). back to context
This note is also not strictly true. In fact, the whole file will probably be in memory before you map anything because you just wrote the thing with dd. So you’ll just be doing minor page faults as your program “discovers” it.