Memory in Linux is organized in the form of pages (typically 4 KB in size). Contiguous linear addresses within a page are mapped onto contiguous physical addresses on the RAM chip. However congtiguous pages can be present anywhere on the physical RAM. Access rights and physical address mapping in the kernel is done at a page level rather than for every linear address. A page refers both to the set of linear addresses that it contains as well as to the data contained in this group of addresses.
The paging unit thinks of all physical RAM as partitioned into fixed-length page frames. Each page frame contains a page. A page frame is a constituent of main memory, and hence it is a storage area. It is important to distinguish a page from a page frame; the former is just a block of data, which may be stored in any page frame or on disk. The paging unit translates linear addresses into physical ones. One key task in the unit is to check the requested access type against the access rights of the linear address. If the memory access is not valid, it generates a Page Fault exception (see Chapter 4 and Chapter 8). The data structures that map linear to physical addresses are called page tables ; they are stored in main memory and must be properly initialized by the kernel before enabling the paging unit.
Pages can optionally be 4 MB in size. However this is not advised except for applications where the expected data unit is large.
The kernel considers the following page frames as reserved:
A page contained in a reserved page frame can never be dynamically assigned or swapped to disk. As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 3 MB of RAM
The remaining portion of the RAM barring the reserved page frames is called dynamic memory. It is a valuable resource, needed not only by the processes but also by the kernel itself. In fact, the performance of the entire system depends on how efficiently dynamic memory is managed. Therefore, all current multitasking operating systems try to optimize the use of dynamic memory, assigning it only when it is needed and freeing it as soon as possible.
The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on
A kernel function gets dynamic memory in a fairly straightforward manner since the kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.
When allocating memory to User Mode processes, the situation is entirely different:
When a User Mode process asks for dynamic memory, it doesn't get additional page frames; instead, it gets the right to use a new range of linear addresses, which become part of its address space. This interval is called a "memory region". A memory region consists of a range of linear addresses representing one or more page frames. Each memory region therefore consists of a set of pages that have consecutive page numbers.
Following are some typical situations in which a process gets new memory regions:
The term demand paging denotes a dynamic memory allocation technique that consists of deferring page frame allocation until the last possible moment until the process attempts to address a page that is not present in RAM, thus causing a Page Fault exception
Fig 9.4
The motivation behind demand paging is that processes do not address all the addresses included in their address space right from the start; in fact, some of these addresses may never be used by the process. Moreover, the program locality principle ensures that, at each stage of program execution, only a small subset of the process pages are really referenced, and therefore the page frames containing the temporarily useless pages can be used by other processes. Demand paging is thus preferable to global allocation (assigning all page frames to the process right from the start and leaving them in memory until program termination), because it increases the average number of free page frames in the system and therefore allows better use of the available free memory. From another viewpoint, it allows the system as a whole to get better throughput with the same amount of RAM.
The price to pay for all these good things is system overhead: each Page Fault exception induced by demand paging must be handled by the kernel, thus wasting CPU cycles. Fortunately, the locality principle ensures that once a process starts working with a group of pages, it sticks with them without addressing other pages for quite a while. Thus page Fault exceptions may be considered rare events.
An addressed page may not be present in main memory either because the page was never accessed by the process, or because the corresponding page frame has been reclaimed by the kernel.
Linux allws overcommitting memory to processes. As we have seen that even though a process may malloc() 1 GB, Linux does not issue it 1 GB immediately, but rather only issues memory when the process actually needs it. Additionally Linux can overcommit the memory allocation. So if 5 processes each ask for 1 GB but the total amount of RAM and swap add up to only 4 GB, Linux may still allocate the 5 GB without any error. The overcommit settings depend on overcommit_memory and overcommit_ratio settings of the vm. Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt for further details on these parameters. In most cases overcommitting will not have any negative impact on the system unless you know your processes will utilize all of the memory that they are granted, and no addtl memory will be left over. On the other hand overcommitting does not have any advantage in server environments where capacity planning and calculations should be accurately performed.
atop shows the overcommit limit and the current committed memory, but this can be a bit misleading. I explain this calculation below
atop output:
MEM | tot 11.7G | free 75.5M | cache 3.9G | dirty 66.7M | buff 42.1M | slab 198.8M |
SWP | tot 2.0G | free 2.0G | vmcom 9.2G | vmlim 7.8G |
meminfo output
[user@server ~]$ cat /proc/meminfo
MemTotal: 12305340 kB
MemFree: 73672 kB
Buffers: 43120 kB
Cached: 4074220 kB
SwapTotal: 2048276 kB
SwapFree: 2047668 kB
Dirty: 62236 kB
Slab: 203948 kB
CommitLimit: 8200944 kB
Committed_AS: 9630052 kB
[user@server ~]$
Page faults and swapping are two independent processes. Page faults take place when a process requests for an address that has been allocated to it but has not yet been assigned to it. Upon receving this request the Kernel confirms that the address being requested belongs to the process and if so then allocates a new page frame from memory and assigns it to the process.
Swapping occurs in one of two scenarios -
In general page faults are rare since they only occur when a process needs to access latent memory space. Infact on a long running server where there are no new processes being forked, page faults should almost never occur
Swapping is bad for performance and should also never occur in a well planned deployment. Swapping will almost always signify that your server does not have adequate memory to run all its processes. Infact during constant swapping all your memory is used up by existing processes. There is no memory available for the disk cache either. Constant swapping can bring a server to a standstill. It is important to note that lack of memory for the page cache will never cause swapping. It is only when there is no memory available for your processes that swapping occurs.
The resident size of a process (as shown in top or ps) represents the amount of non-swapped memory the kernel has already allocated to the process. This number is inaccurate when totalled (especially in a multi=process app like postgres or apache) since it includes shared memory. This also does not include the swapped out portion of the process. VmSize is the total memory of a program including its resident size, swap size, code, data, shared libraries etc. The SWAP column in top is calculated using VsSize - RSS which I believe is an incorrect calculation. Lets take an example and uinderstand these numbers better -
[user@server ~]$ cat /proc/9894/status
Name: java
State: S (sleeping)
VmPeak: 4109896 kB
VmSize: 4099492 kB
VmLck: 0 kB
VmHWM: 2855336 kB
VmRSS: 2848964 kB
VmData: 4000304 kB
VmStk: 84 kB
VmExe: 36 kB
VmLib: 65392 kB
VmPTE: 5940 kB
We can conclude from the above -
There is another aspect to remember here. Even though the resident size of the above program is 2.71 GB that too does not mean that the program is actually using 2.71 GB at this time. For instance in the above java program, java requests the kernel to consistently provide it additional memory whenever it needs addtl memory upto the limit specified for the java process. This memory is resident memory (unless a portion is swapped out). However after running an intensive process when java clears a large set of objects through a gc, this memory is not given back to the OS. The actual memory used by java at a point in time maybe significantly lesser than RSS. This can be measured independently provided the process allows you to do so.
Note that the VmHWM parameter is interesting inasmuch as it signifies the amount of physical memory required for the process at peak times.
Minor page fault: If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault. The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory; it does not need to read the page into memory. This could happen if the memory is shared by different programs and the page is already brought into memory for other programs.
Major page fault: If the page is not loaded in memory at the time the fault is generated, then it is called a major or hard page fault. The page fault handler in the operating system needs to find a free page in memory, or choose a page in memory to be used for this page's data, write out the data in that page if it hasn't already been written out since it was last modified, mark that page as not being loaded into memory, read the data for that page into the page, and then make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory. Major faults are more expensive than minor page faults and may add disk latency to the interrupted program's execution. This is the mechanism used by an operating system to increase the amount of program memory available on demand. The operating system delays loading parts of the program from disk until the program attempts to use it and the page fault is generated.
Invalid page fault: If a page fault occurs for a reference to an address that's not part of the virtual address space, so that there can't be a page in memory corresponding to it, then it is called an invalid page fault. The page fault handler in the operating system then needs to terminate the code that made the reference, or deliver an indication to that code that the reference was invalid.
(More details available in the disk IO section)
The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.
Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds (unless an explicit fsync() is called), thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).
Typically the kernel will use as much of the dynamic memory available to it for the page cache, only reclaiming page frames from the page cache peridically or as and when needed by a process or by newer pages that need to be written into the page cache. When the system load is low, the RAM is filled mostly by the disk caches and the few running processes can benefit from the information stored in them. However, when the system load increases, the RAM is filled mostly by pages of the processes and the caches are shrunken to make room for additional processes. Page reclaiming by default uses an LRU algorithm.
Read http://www.redhat.com/magazine/001nov04/features/vm/ for details on the lifecycle of a memory page
The objective of the page frame reclaiming algorithm (PFRA ) is to pick up page frames and make them free. The PFRA is invoked under different conditions and handles page frames in different ways based on their content.
The PFRA is invoked on one of the following -
The types of pages are as follows -
In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache. In almost all cases, mapped pages are syncable: in order to reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the page contents in the corresponding disk file.
Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In order to reclaim the page frame, the kernel must save the page contents in a dedicated disk partition or disk file called "swap area" therefore, all anonymous pages are swappable
When the PFRA must reclaim a page frame belonging to the User Mode address space of a process, it must take into consideration whether the page frame is shared or non-shared . A shared page frame belongs to multiple User Mode address spaces, while a non-shared page frame belongs to just one. Notice that a non-shared page frame might belong to several lightweight processes referring to the same memory descriptor. Shared page frames are typically created when a process spawns a child or when two or more processes access the same file by means of a shared memory mapping
PFRA algorithm considerations:
The PFRA classifies memory into active and inactive. /proc/meminfo provides the current active and inactive memory. Here is an eg -
[root@server]# cat /proc/meminfo
MemTotal: 132093140 kB
MemFree: 591272 kB
Buffers: 239488 kB
Cached: 125650056 kB
SwapCached: 0 kB
Active: 25157088 kB
Inactive: 103410468 kB
HighTotal: 0 kB
HighFree: 0 kB
<snip>
This shows that active memory is 25 GB while inactive is 103 GB. Starting from Linux Kernel 2.6.xx onwards these functions are handled by pdflush and kswapd and the Page Frame Reclaiming Algorithm.
Linux maintains two lists in the page cache - the Active List and the Inactive List. The Page Frame Reclaiming Algorithm gathers pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the PFRA gathers the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.
Clearly, two page states ("active" and "inactive") are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is "inactive" for most of the time, the access makes it "active," thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem, because the PFRA has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.
The PG_referenced flag in the page descriptor is used to double the number of accesses required to move a page from the inactive list to the active list; it is also used to double the number of "missing accesses" required to move a page from the active list to the inactive list (see below). For instance, suppose that a page in the inactive list has the PG_referenced flag set to 0. The first page access sets the value of the flag to 1, but the page remains in the inactive list. The second page access finds the flag set and causes the page to be moved in the active list. If, however, the second access does not occur within a given time interval after the first one, the page frame reclaiming algorithm may reset the PG_referenced flag.
The active and inactive memory can be used to infer a bunch of stuff as follows -
Despite the PFRA effort to keep a reserve of free page frames, it is possible for the pressure on the virtual memory subsystem to become so high that all available memory becomes exhausted. This situation could quickly induce a freeze of every activity in the system: the kernel keeps trying to free memory in order to satisfy some urgent request, but it does not succeed because the swap areas are full and all disk caches have already been shrunken. As a consequence, no process can proceed with its execution, thus no process will eventually free up the page frames that it owns.
To cope with this dramatic situation, the PFRA makes use of a so-called out of memory (OOM) killer, which selects a process in the system and abruptly kills it to free its page frames. The OOM killer is like a surgeon that amputates the limb of a man to save his life: losing a limb is not a nice thing, but sometimes there is nothing better to do.
The out_of_memory() when the free memory is very low and the PFRA has not succeeded in reclaiming any page frames. The function selects a victim among the existing processes, then invokes oom_kill_process() to perform the sacrifice.
Of course the process is not picked at random. The selected process should satisfy several requisites:
One indicator of running into OOM is to look at the combination of free memory, Inactive memory and free swap in /proc/meminfo as explained below -
[user@server ~]$ cat /proc/meminfo
MemTotal: 12305340 kB
MemFree: 79968 kB
Buffers: 165376 kB
Cached: 3500048 kB
SwapCached: 0 kB
Active: 9819744 kB
Inactive: 1787500 kB
SwapTotal: 2048276 kB
SwapFree: 2047668 kB
Dirty: 80108 kB
In the above example -
If your server ever has an issue where the OOM killer was activated you have seriously neglected your memory monitoring. This condition must NEVER take place on any server.
Check http://linux-mm.org/Drop_Caches to learn how to drop the page cache in Linux. You can experiment with this command in combination with the output of meminfo (Cached, Active memory, Inactive memory) and fincore to determine how much of your data store is typically loaded into cache within how much time and what portion of it is extremely active.
MEM | tot 126.0G | free 6.4G | cache 113.2G | dirty 924.9M | buff 394.7M | slab 1.8G
SWP | tot 2.0G | free 2.0G | vmcom 10.1G | vmlim 65.0G |
atop shows the system memory as a whole broken up as -
> cat /proc/meminfo
[bhavin.t@mongo-history-1 ~]$ cat /proc/meminfo
MemTotal: 62168992 kB
MemFree: 287900 kB
Buffers: 12264 kB
Cached: 59953784 kB
SwapCached: 0 kB
Active: 29934172 kB
Inactive: 30168836 kB
Active(anon): 137004 kB
Inactive(anon): 24 kB
Active(file): 29797168 kB
Inactive(file): 30168812 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 10832 kB
Writeback: 0 kB
AnonPages: 136704 kB
Mapped: 863444 kB
Shmem: 68 kB
Slab: 1526616 kB
SReclaimable: 1498556 kB
SUnreclaim: 28060 kB
KernelStack: 1520 kB
PageTables: 110824 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 31084496 kB
Committed_AS: 393640 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 116104 kB
VmallocChunk: 34359620200 kB
DirectMap4k: 63496192 kB
DirectMap2M: 0 kB
This file shows detailed virtual memory statistics from the kernel. Most of the counters explained below are available only if you have kernel compiled with VM_EVENT_COUNTERS config option turned on. That's so because most of the parameters below have no function for the kernel itself, but are useful for debugging and statistics purposes
[user@server proc]$ cat /proc/vmstat
nr_anon_pages 2014051
nr_mapped 11691
nr_file_pages 890051
nr_slab_reclaimable 128956
nr_slab_unreclaimable 9670
nr_page_table_pages 5628
nr_dirty 15158
nr_writeback 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 4737
pgpgin 2280999
pgpgout 76513335
pswpin 0
pswpout 152
pgalloc_dma 1
pgalloc_dma32 27997500
pgalloc_normal 108826482
pgfree 136842914
pgactivate 24663564
pgdeactivate 8083378
pgfault 266178186
pgmajfault 2228
pgrefill_dma 0
pgrefill_dma32 6154199
pgrefill_normal 19920764
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 3203616
pgscan_kswapd_normal 4431168
pgscan_direct_dma 0
pgscan_direct_dma32 1056
pgscan_direct_normal 2368
pginodesteal 0
slabs_scanned 391808
kswapd_steal 7598807
kswapd_inodesteal 0
pageoutrun 49495
allocstall 37
pgrotated 154
Of the above the following are important -
[user@server ~]$ vmstat -a -S M 5
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 2 6593 394 115893 0 0 690 767 1 2 32 12 53 4 0
3 0 2 6585 394 115901 0 0 204 6310 6005 23103 29 15 53 2 0
2 1 2 6549 394 115912 0 0 182 4707 5102 20867 38 13 48 2 0
[user@server ~]$ vmstat -S M 5
procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free inact active si so bi bo in cs us sy id wa st
4 0 2 6390 48082 71527 0 0 690 767 1 2 32 12 53 4 0
2 0 2 6383 48082 71534 0 0 87 4614 5859 21944 34 13 51 1 0
3 0 2 6376 48082 71543 0 0 137 5164 4925 19994 23 12 64 1 0
vmstat shows the following memory related fields -
[user@server ~]$ cat /proc/7278/status
<snip>
FDSize: 1024
Groups: 26
VmPeak: 3675100 kB
VmSize: 3675096 kB
VmLck: 0 kB
VmHWM: 81160 kB
VmRSS: 81156 kB
VmData: 944 kB
VmStk: 84 kB
VmExe: 3072 kB
VmLib: 2044 kB
VmPTE: 244 kB
StaBrk: 0ac3c000 kB
Brk: 0ac82000 kB
StaStk: 7fff35863220 kB
Threads: 1
[user@server ~]$ cat /proc/7278/statm
918774 20289 20186 768 0 257 0
Table 1-2: Contents of the statm files (as of 2.6.8-rc3)
..............................................................................
Field Content
size total program size (pages) (same as VmSize in status)
resident size of memory portions (pages) (same as VmRSS in status)
shared number of pages that are shared (i.e. backed by a file)
trs number of pages that are 'code' (not including libs; broken,
includes data segment)
lrs number of pages of library (always 0 on 2.6)
drs number of pages of data/stack (including libs; broken,
includes library text)
dt number of dirty pages (always 0 on 2.6)
..............................................................................
[user@server ~]$ cat /proc/7278/stat
7278 (postgres) S 1 7257 7257 0 -1 4202496 36060376 10845160168 0 749 20435 137212 158536835 39143290 15 0 1 0 50528579 3763298304 20289 18446744073709551615 4194304 7336916 140734091375136 18446744073709551615 225773929891 0 0 19935232 84487 0 0 0 17 2 0 0 12
Table 1-3: Contents of the stat files (as of 2.6.22-rc3)
..............................................................................
Field Content
pid process id
tcomm filename of the executable
state state (R is running, S is sleeping, D is sleeping in an
uninterruptible wait, Z is zombie, T is traced or stopped)
ppid process id of the parent process
pgrp pgrp of the process
sid session id
tty_nr tty the process uses
tty_pgrp pgrp of the tty
flags task flags
min_flt number of minor faults
cmin_flt number of minor faults with child's
*maj_flt number of major faults
cmaj_flt number of major faults with child's
utime user mode jiffies
stime kernel mode jiffies
cutime user mode jiffies with child's waited for
cstime kernel mode jiffies with child's waited for
priority priority level
nice nice level
num_threads number of threads
it_real_value (obsolete, always 0)
start_time time the process started after system boot
vsize virtual memory size
rss resident set memory size
rsslim current limit in bytes on the rss
start_code address above which program text can run
end_code address below which program text can run
start_stack address of the start of the stack
esp current value of ESP
eip current value of EIP
pending bitmap of pending signals (obsolete)
blocked bitmap of blocked signals (obsolete)
sigign bitmap of ignored signals (obsolete)
sigcatch bitmap of catched signals (obsolete)
wchan address where process went to sleep
0 (place holder)
0 (place holder)
exit_signal signal to send to parent thread on exit
task_cpu which CPU the task is scheduled on
rt_priority realtime priority
policy scheduling policy (man sched_setscheduler)
blkio_ticks time spent waiting for block IO
..............................................................................
[user@server ~]$ cat /proc/7278/smaps
00400000-00700000 r-xp 00000000 08:03 6424710 /usr/local/postgres/pgsql8.2.3/bin/postgres
Size: 3072 kB
Rss: 2108 kB
Shared_Clean: 2108 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Swap: 0 kB
2b3a78a33000-2b3b5493f000 rw-s 00000000 00:09 1114115 /SYSV0052e2c1 (deleted)
Size: 3603504 kB
Rss: 2129800 kB
Shared_Clean: 54300 kB
Shared_Dirty: 2075500 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Swap: 0 kB
smaps shows for each process the memory distribution for various libraries, data and programs and what portion of it is shared. for instance above I have snipped out two entries from postgres showing that the postgres executable is taking 2 MB of shared memory and the postgres internal cache is taking 2 GB of shared memory.
[root@server]# pmap -x 30850 | less
Address Kbytes RSS Dirty Mode Mapping
0000000040000000 36 0 0 r-x-- java
0000000040108000 8 8 8 rwx-- java
0000000041373000 1469492 1469352 1469352 rwx-- [ anon ]
000000071ae00000 45120 44740 44740 rwx-- [ anon ]
000000071da10000 38848 0 0 ----- [ anon ]
0000000720000000 3670016 3670016 3670016 rwx-- [ anon ]
00007ff67286f000 12 0 0 ----- [ anon ]
00007ff672872000 1016 24 24 rwx-- [ anon ]
00007ff672970000 12 0 0 ----- [ anon ]
00007ff672973000 1016 24 24 rwx-- [ anon ]
...
Mem: 132093140k total, 128645860k used, 3447280k free, 413200k buffers
Swap: 2096472k total, 2596k used, 2093876k free, 122750144k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA nFLT nDRT COMMAND
21827 postgres 15 0 3626m 2.1g 2.0g S 15.5 1.6 10:20.94 1.5g 3072 32m 0 0 postgres
19638 postgres 15 0 3626m 2.1g 2.0g S 14.5 1.6 14:03.23 1.5g 3072 32m 0 0 postgres
27306 postgres 15 0 3618m 2.1g 2.0g R 11.6 1.6 9:34.90 1.5g 3072 24m 0 0 postgres
19673 postgres 15 0 3626m 2.1g 2.0g S 10.9 1.6 8:40.20 1.5g 3072 32m 0 0 postgres
22068 postgres 15 0 3626m 2.1g 2.0g S 10.2 1.6 15:20.89 1.5g 3072 32m 0 0 postgres
4339 postgres 15 0 3618m 2.1g 2.0g S 8.6 1.6 8:04.42 1.5g 3072 24m 0 0 postgres
top shows the following global memory related fields -
vmtouch is a great tool for learning about and controlling the file system cache of unix and unix-like systems. You can use it to learn about how much of a file is in memory, what files should be evicted from memory etc
Example 1
How much of the /bin/ directory is currently in cache?
$ vmtouch /bin/
Files: 92
Directories: 1
Resident Pages: 348/1307 1M/5M 26.6%
Elapsed: 0.003426 seconds
Example 2
We have 3 big datasets, a.txt, b.txt, and c.txt but only 2 of them will fit in memory at once. If we have a.txt and b.txt in memory but would now like to work with b.txt and c.txt, we could just start loading up c.txt but then our system would evict pages from both a.txt (which we want) and b.txt (which we don't want).
So let's give the system a hint and evict a.txt from memory, making room for c.txt:
$ vmtouch -ve a.txt
Evicting a.txt
Files: 1
Directories: 0
Evicted Pages: 42116 (164M)
Elapsed: 0.076824 seconds
fincore is a great tool that can be used to measure how much of a file is currently in the disk cache. This can be used to determine rough cache usage for an application.
root@xxxxxx:/var/lib/mysql/blogindex# fincore --pages=false --summarize --only-cached * stats for CLUSTER_LOG_2010_05_21.MYI: file size=93840384 , total pages=22910 , cached pages=1 , cached size=4096, cached perc=0.004365 stats for CLUSTER_LOG_2010_05_22.MYI: file size=417792 , total pages=102 , cached pages=1 , cached size=4096, cached perc=0.980392 stats for CLUSTER_LOG_2010_05_23.MYI: file size=826368 , total pages=201 , cached pages=1 , cached size=4096, cached perc=0.497512 stats for CLUSTER_LOG_2010_05_24.MYI: file size=192512 , total pages=47 , cached pages=1 , cached size=4096, cached perc=2.127660 stats for CLUSTER_LOG_2010_06_03.MYI: file size=345088 , total pages=84 , cached pages=43 , cached size=176128, cached perc=51.190476 stats for CLUSTER_LOG_2010_06_04.MYD: file size=1478552 , total pages=360 , cached pages=97 , cached size=397312, cached perc=26.944444 stats for CLUSTER_LOG_2010_06_04.MYI: file size=205824 , total pages=50 , cached pages=29 , cached size=118784, cached perc=58.000000
Optimizing memory usage consists of the following principles -
This can be achieved as follows -
Page faults will occur when a new process is forked or created or when an existing process requests for additional memory allocation. However these situations in a constantly running server should not be too high, and therefore you should see very rare page faulting on the server, especially major faults (minor faults are fine - they require no disk access. major faults may require some disk access)
Here is a small probabilistic model that shows how segregating the disk caches for different applications can help optimize memory usage -
Refer to http://www.westnet.com/~gsmith/content/linux-pdflush.htm and http://www.cyberciti.biz/faq/linux-kernel-tuning-virtual-memory-subsystem/ for tips on tuning various parameters of pdflush and kswapd which control the page reclaim logic of your disk cache
If your application does not require to fsync() data immediately, you can gain a considerable performance boost out of the write-back nature of the disk cache. Most databases, mail servers etc fsync() on each write since they cannot afford to lose data. However, if you have created a custom data store, you may have a model wherein you write the same data onto multiple nodes synchronously. In this case all of the nodes do not need to fsync() the data, since a replica is available incase of a total node failure. In this situation the total number of writes will reduce since many updates will cancel previous writes and multiple writes can be merged together resulting in lesser IOPs which helps both incase of flash drives and SATA drives.
Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt