Recovery
Once an OOM situation occurs, now what? The kernel will terminate one process for sure. Why kill? This is the only way to stop further memory requests. The kernel can not assume there is a sophisticated mechanism inside the process to stop further requests automatically, so it has no other choice but to kill it.
How does the kernel know exactly which process to kill? The answer lies inside
mm/oom_kill.c of the Linux source code. This C code represents the so-called OOM killer of the Linux kernel. The function
badness()
give a score to each existing processes. The one with highest score will be the victim. The criteria are:
- VM size. This is not the sum of all allocated pages, but the sum of the size of all VMAs owned by the process. The bigger the VM size, the higher the score.
- Related to #1, the VM size of the process's children are important too. The VM size is cumulative if a process has one or more children.
- Processes with task priorities smaller than zero (niced processes) get more points.
- Superuser processes are important, by assumption; thus they have their scores reduced.
- Process runtime. The longer it runs, the lower the score.
- Processes that perform direct hardware access are more immune.
- The swapper (pid 0) and init (pid 1) processes, as well as any kernel threads immune from the list of potential victims.
The process with the biggest score "wins" the election and the OOM killer will kill it very soon.
The heuristic isn't perfect, but usually it works well for most situations. Criteria #1 and #2 clearly show that it is the VMA size that matters, not the number of actual pages a process has. You might think that measuring VMA size will trigger a false alarm, but luckily it doesn't. The
badness()
call occurs inside the page allocation functions when there are few free pages left and page frame reclamation fails, so the VMA size closely matches the number of pages owned by the process.
Why not just count the actual number of pages? That would require more time and require the use of locks, thus making the procedure too expensive to make a fast decision. Knowing that OOM killer isn't perfect, you must be ready for a wrong kill.
The kernel uses the
SIGTERM
signal to inform the target process that it should stop.
How to Reduce OOM Risk
The simple rule to avoid OOM risk is actually simple: don't allocate beyond the machine's current free space. However, many factors come into play, so there are further refinements to the strategy.
Reduce Fragmentation by Properly Ordering Allocation
There is no need to use any sophisticated allocator. You can reduce fragmentation by properly ordering memory allocation and deallocation. As holes easily happen, employ the LIFO strategy: the last one you allocate is the first you need to free.
For example, instead of doing:
void *a;
void *b;
void *c;
............
a = malloc(1024);
b = malloc(5678);
c = malloc(4096);
......................
free(b);
b = malloc(12345);
It's better to do:
a = malloc(1024);
c = malloc(4096);
b = malloc(5678);
......................
free(b);
b = malloc(12345);
This way, there won't be any hole between the
a
and
c
chunks. You can also consider
realloc()
to resize any existing
malloc()
-ed blocks.
Two example programs (
fragmented1.c and
fragmented2.c) demonstrate the effect of allocation rearrangement. Reports at the end of both programs give the number of bytes allocated by the system (kernel and
glibc
allocator) and the number of bytes actually used. For example, on kernel 2.6.11.1, with
glibc
2.3.3-27 and executing without giving an explicit parameter,
fragmented1
wasted 319858832 bytes (about 305 MB) while
fragmented2
wasted 2089200 bytes (about 2MB). That's 152 times smaller!
You can do further experiments by passing various numbers as the program parameter. This parameter acts as the request size of the
malloc()
call.
Tweak Kernel's Overcommit Behavior
You can change the behavior of the Linux kernel through the
/proc filesystem, as documented in
Documentation/vm/overcommit-accounting in the Linux kernel's source code. You have three choices when tuning kernel overcommit, expressed as numbers in
/proc/sys/vm/overcommit_memory:
0
means that the kernel will use predefined heuristics when deciding whether to allow such an overcommit. This is the default.
1
always overcommits. Perhaps you now realize the danger of this mode.
2
prevents overcommit from exceeding a certain watermark. The watermark is also tunable through /proc/sys/vm/overcommit_ratio. Within this mode, the total commit can not exceed the swap space(s) size + overcommit_ratio percent * RAM size. By default, the overcommit ratio is 50.
The default mode usually work quite fine in most situation, but mode #2 offers better protection toward overcommit. On the other hand, mode #2 requires you to predict carefully how much space all running applications need. You certainly don't want to see your application unable to get more memory chunks just because the limit is too strict. However, mode #2 is a best way to avoid having a program killed suddenly.
Suppose that you have 256MB of RAM and 256MB of swap and you want to limit overcommit at 384MB. That means 256 + 50 percent * 256MB, so put 50 on
/proc/sys/vm/overcommit_ratio.
Check for NULL
Pointer after Memory Allocation and Audit for Memory Leak
This is a simple rule, but it sometimes goes omitted. By checking for
NULL
, at least you know that the allocator
could extend the memory area, although there is no obvious guarantee that it will allocate the needed pages later. Usually, you need to bail out or delay the allocation for a moment, depending on your scenarios. Together with overcommit tunables, you have a decent tool to anticipate OOM because
malloc()
will return
NULL
if it believes that it cannot acquire free pages later.
Memory leak is also a source of unnecessary memory consumption. A leaked memory block is one that the application no longer tracks, but that the kernel will not reclaim because, from the kernel's point of view, the task still has it under control. Valgrind is a nice tool to find out such occurrences inside your code without the need to re-code.
Always Consult Memory Allocation Statistics
The Linux kernel provides
/proc/meminfo as a way to find complete information about memory conditions. This
/proc entry is also an information source for utilities such as
top
,
free
, and
vmstat
.
What you need to check is the free and reclaimable memory. The word "free" needs no further explanation, but what does "reclaimable" mean? It refers to buffers and page caches--the disk cache. They are reclaimable because, when memory is tight, the Linux kernel can simply flush them out back to the disk. These are file-backed pages. I've lightly edited this example of memory statistics:
$ cat /proc/meminfo
MemTotal: 255944 kB
MemFree: 3668 kB
Buffers: 13640 kB
Cached: 171788 kB
SwapCached: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 255944 kB
LowFree: 3668 kB
SwapTotal: 909676 kB
SwapFree: 909676 kB
Based on this above output, the free virtual memory is MemFree + Buffers + Cached + SwapFree = 1098772 kB.
I failed to find any formalized C (
glibc
) function to find out free (including reclaimable) memory space. The closest I found is by using
get_avphys_pages()
or
sysconf()
(with the
_SC_AVPHYS_PAGES
parameter). They only report the amount of free memory, not the free + reclaimable amount.
That means to get
precise information, you must programmatically parse the
/proc/meminfo
and calculate it by yourself. If you're lazy, take the
procps
source package as a reference on how to do it. This package contains tools such as
ps
,
top
, and
free
. It is available under the GPL.
Experiments with Alternative Memory Allocators
Different allocators yield different ways to manage memory chunks and to shrink, expand, and create virtual memory areas. Hoard is one example. Emery Berger from the University of Massachusetts wrote it as a high performance memory allocator. Hoard seems to work best for multi-threaded applications; it introduces the concept of per-CPU heap.
Use 64-bit Platforms
Users who need larger user address spaces can consider using 64-bit platforms. The Linux kernel no longer uses the 3:1 VM split for these machines. In other words, user space becomes quite large. It can be a good match for machines with more than 4GB of RAM.
This has no connection to extended addressing schemes, such as Intel's Physical Address Extension (PAE), which allows a 32-bit Intel processor to address up to 64GB of RAM. This addressing deals with physical address, while in the virtual address context, the user space itself is still 3GB (assuming the 3:1 VM split). This extra memory is reachable, but not all mappable into the address space. Unmappable portions of RAM are unusable.
Consider Packed Types on Structures
Packed attributes can help to squeeze the size of
struct
s,
enum
s, and
union
s. This is a way to save more bytes, especially for array of
struct
s. Here is a declaration example:
struct test
{
char a;
long b;
} __attribute__ ((packed));
The con for this action is that it makes certain field(s) unaligned and thus it costs more CPU cycles to access the field. "Aligned" here means the variable's address is a multiple of its data type's natural size. The net result is that, depending on the data access frequency, the runtime may get relatively slower. However, take into account page alignment and cache coherence.
Use ulimit()
for User Processes
With
ulimit -v
, you can limit the address space a process can allocate with
mmap()
. When you reach the limit, all
mmap()
, and hence
malloc()
, calls will return 0 and the kernel's OOM killer will never start. This is most useful in a multi-user environment where you cannot trust all of the users and want to avoid killing random processes.
Acknowledgement
The author gives credits to several people for their assistance and help: Peter Ziljtra, Wolfram Gloger, and Rene Hermant. Mr. Gloger also contributed the
ulimit()
technique.
References
- "Dynamic Storage Allocation: A Survey and Critical Review," by Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Proceeding 1995 International Workshop of Memory Management.
- Hoard: A Scalable Memory Allocator for Multithreaded Applications, by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson
- "Once upon a
free()
" by Anonymous, Phrack Volume 0x0b, Issue 0x39, Phile #0x09 of 0x12.
- "Vudo: An Object Superstitiously Believed to Embody Magical Powers," by Michel "MaXX" Kaempf. Phrack Volume 0x0b, Issue 0x39, Phile #0x08 of 0x12.
- "Policy-Based Memory Allocation," by Andrei Alexandrescu and Emery Berger. C/C++ Users Journal.
- "Security of memory allocators for C and C++," by Yves Younan, Wouter Joosen, Frank Piessens, and Hans Van den Eynden. Report CW419, July 2005
- Lecture notes (CS360) about
malloc()
, by Jim Plank, Dept. of Computer Science, University of Tennessee.
- "Inside Memory Management: The Choices, Tradeoffs, and Implementations of Dynamic Allocation," by Jonathan Bartlett
- "The Malloc Maleficarum," by Phantasmal Phantasmagoria
- Understanding The Linux Kernel, 3rd edition, by Daniel P. Bovet and Marco Cesati. O'Reilly Media, Inc.
Mulyadi Santosa is a freelance writer who lives in Indonesia.