The Lost Art of C Structure Packing

原文: http://www.catb.org/esr/structure-packing/

The Lost Art of C Structure Packing

Eric S. Raymond

Table of Contents

1. Who should read this
2. Why I wrote it
3. Alignment requirements
4. Padding
5. Structure alignment and padding
6. Structure reordering
7. Awkward scalar cases
8. Readability and cache locality
9. Other packing techniques
10. Overriding alignment rules
11. Tools
12. Proof and exceptional cases
13. Related Reading
14. Version history

1. Who should read this

This page is about a technique for reducing the memory footprint of Cprograms - manually repacking C structure declarations for reducedsize. To read it, you will require basic knowledge of the Cprogramming language.

You need to know this technique if you intend to write code formemory-constrained embedded systems, or operating-system kernels. Itis useful if you are working with application data sets so large thatyour programs routinely hit memory limits. It is good to know in anyapplication where you really, really care about minimizingcache-line misses.

Finally, knowing this technique is a gateway to other esoteric Ctopics. You are not an advanced C programmer until you have graspedit. You are not a master of C until you could have written thisdocument yourself and can criticize it intelligently.

2. Why I wrote it

This webpage exists because in late 2013 I found myself heavilyapplying a C optimization technique that I had learned more than twodecades previously and not used much since.

I needed to reduce the memory footprint of a program that used thousands -sometimes hundreds of thousands - of C struct instances. The programwas cvs-fast-export andthe problem was that it was dying with out-of-memory errors on largerepositories.

There are ways to reduce memory usage significantly in situations likethis, by rearranging the order of structure members in careful ways.This can lead to dramatic gains - in my case I was able to cut theworking-set size by around 40%, enabling the program to handlemuch larger repositories without dying.

But as I worked, and thought about what I was doing, it began to dawnon me that the technique I was using has been more than half forgottenin these latter days. A little web research confirmed that Cprogrammers don’t seem to talk about it much any more, at least notwhere a search engine can see them. A couple of Wikipedia entriestouch the topic, but I found nobody who covered it comprehensively.

There are actually reasons for this that aren’t stupid. CS courses(rightly) steer people away from micro-optimization towards finding betteralgorithms. The plunging price of machine resources has made squeezingmemory usage less necessary. And the way hackers used to learn how todo it back in the day was by bumping their noses on strange hardwarearchitectures - a less common experience now.

But the technique still has value in important situations, and willas long as memory is finite. This document is intended to save Cprogrammers from having to rediscover the technique, so they canconcentrate effort on more important things.

3. Alignment requirements

The first thing to understand is that, on modern processors, the wayyour C compiler lays out basic C datatypes in memory is constrainedin order to make memory accesses faster.

Storage for the basic C datatypes on an x86 or ARM processor doesn’tnormally start at arbitrary byte addresses in memory. Rather, eachtype except char has an alignment requirement; chars can start onany byte address, but 2-byte shorts must start on an even address,4-byte ints or floats must start on an address divisible by 4, and8-byte longs or doubles must start on an address divisible by 8.Signed or unsigned makes no difference.

The jargon for this is that basic C types on x86 and ARM areself-aligned. Pointers, whether 32-bit (4-byte) or 64-bit (8-byte)are self-aligned too.

Self-alignment makes access faster because it facilitates generatingsingle-instruction fetches and puts of the typed data. Withoutalignment constraints, on the other hand, the code might end up havingto do two or more accesses spanning machine-word boundaries.Characters are a special case; they’re equally expensive from anywherethey live inside a single machine word. That’s why they don’t have apreferred alignment.

I said "on modern processors" because on some older ones forcing yourC program to violate alignment rules (say, by casting an odd addressinto an int pointer and trying to use it) didn’t just slow your codedown, it caused an illegal instruction fault. This was the behavior, forexample, on Sun SPARC chips. In fact, with sufficient determinationand the right (e18) hardware flag set on the processor, you can stilltrigger this on x86.

Also, self-alignment is not the only possible rule. Historically, someprocessors (especially those lackingbarrel shifters) have hadmore restrictive ones. If you do embedded systems, you might trip over oneof these lurking in the underbrush. Be aware this is possible.

4. Padding

Now we’ll look at a simple example of variable layout inmemory. Consider the following series of variable declarationsin the top level of a C module:

char *p;
char c;
int x;

If you didn’t know anything about data alignment, you might assumethat these three variables would occupy a continuous span of bytes inmemory. That is, on a 32-bit machine 4 bytes of pointer would beimmediately followed by 1 byte of char and that immediately followedby 4 bytes of int. And a 64-bit machine would be different onlyin that the pointer would be 8 bytes.

Here’s what actually happens (on an x86 or ARM or anything else withself-aligned types). The storage for p starts on a self-aligned 4-or 8-byte boundary depending on the machine word size. This ispointer alignment - the strictest possible.

The storage for c follows immediately. But the 4-byte alignmentrequirement of x forces a gap in the layout; it comes out as thoughthere were a fourth intervening variable, like this:

char *p;      /* 4 or 8 bytes */
char c;       /* 1 byte */
char pad[3];  /* 3 bytes */
int x;        /* 4 bytes */

The pad[3] character array represents the fact that there are threebytes of waste space in the structure. The old-school term for thiswas "slop".

Compare what happens if x is a 2-byte short:

char *p;
char c;
short x;

In that case, the actual layout will be this:

char *p;      /* 4 or 8 bytes */
char c;       /* 1 byte */
char pad[1];  /* 1 byte */
short x;      /* 2 bytes */

On the other hand, if x is a long on a 64-bit machine

char *p;
char c;
long x;

we end up with this:

char *p;     /* 8 bytes */
char c;      /* 1 byte
char pad[7]; /* 7 bytes */
long x;      /* 8 bytes */

If you have been following carefully, you are probably now wondering aboutthe case where the shorter variable declaration comes first:

char c;
char *p;
int x;

If the actual memory layout were written like this

char c;
char pad1[M];
char *p;
char pad2[N];
int x;

what can we say about M and N?

First, in this case N will be zero. The address of x, comingright after p, is guaranteed to be pointer-aligned, which isnever less strict than int-aligned.

The value of M is less predictable. If the compilerhappened to map c to the last byte of a machine word, the nextbyte (the first of p) would be the first byte of the next oneand properly pointer-aligned. M would be zero.

It is more likely that c will be mapped to the first byte of amachine word. In that case M will be whatever padding is needed toensure that p has pointer alignment - 3 on a 32-bit machine,7 on a 64-bit machine.

Intermediate cases are possible. M can be anything from 0 to 7(0 to 3 on 32-bit) because a char can start on any byte boundaryin a machine word.

If you wanted to make those variables take up less space,you could get that effect by swapping x with c in theoriginal sequence.

char *p;     /* 8 bytes */
long x;      /* 8 bytes */
char c;      /* 1 byte

Usually, for the small number of scalar variables in your C programs,bumming out the few bytes you can get by changing the order ofdeclaration won’t save you enough to be significant. The techniquebecomes more interesting when applied to nonscalar variables -especially structs.

Before we get to those, let’s dispose of arrays of scalars. On aplatform with self-aligned types, arrays ofchar/short/int/long/pointer have no internal padding; eachmember is automatically self-aligned at the end of the next one.

In the next section we will see that the same is not necessarily true ofstructure arrays.

5. Structure alignment and padding

In general, a struct instance will have the alignment of its widest scalarmember. Compilers do this as the easiest way to ensure that all themembers are self-aligned for fast access.

Also, in C the address of a struct is the same as the address of itsfirst member - there is no leading padding. Beware: in C++, classesthat look like structs may break this rule! (Whether they do or notdepends on how base classes and virtual member functions areimplemented, and varies by compiler.)

(When you’re in doubt about this sort of thing, ANSI C provides anoffsetof() macro which can be used to read out structure memberoffsets.)

Consider this struct:

struct foo1 {
    char *p;
    char c;
    long x;
};

Assuming a 64-bit machine, any instance of struct foo1 will have8-byte alignment. The memory layout of one of these looksunsurprising, like this:

struct foo1 {
    char *p;     /* 8 bytes */
    char c;      /* 1 byte
    char pad[7]; /* 7 bytes */
    long x;      /* 8 bytes */
};

It’s laid out exactly as though variables of these types hasbeen separately declared. But if we put c first, that’sno longer true.

struct foo2 {
    char c;      /* 1 byte */
    char pad[7]; /* 7 bytes */
    char *p;     /* 8 bytes */
    long x;      /* 8 bytes */
};

If the members were separate variables, c could start at any byteboundary and the size of pad might vary. Because struct foo2 hasthe pointer alignment of its widest member, that’s no longer possible.Now c has to be pointer-aligned, and following padding of 7 bytes islocked in.

Now let’s talk about trailing padding on structures. To explain this,I need to introduce a basic concept which I’ll call the strideaddress of a structure. It is the first address following thestructure data that has the same alignment as the structure.

The general rule of trailing structure padding is this: the compilerwill behave as though the structure has trailing padding out toits stride address. This rule controls what sizeof() will return.

Consider this example on a 64-bit x86 or ARM machine:

struct foo3 {
    char *p;     /* 8 bytes */
    char c;      /* 1 byte */
};

struct foo3 singleton;
struct foo3 quad[4];

You might think that sizeof(struct foo3) should be 9, but it’sactually 16. The stride address is that of (&p)[2]. Thus, in the quadarray, each member has 7 bytes of trailing padding, because the firstmember of each following struct wants to be self-aligned on an 8-byteboundary. The memory layout is as though the structure had beendeclared like this:

struct foo3 {
    char *p;     /* 8 bytes */
    char c;      /* 1 byte */
    char pad[7];
};

For contrast, consider the following example:

struct foo4 {
    short s;     /* 2 bytes */
    char c;      /* 1 byte */
};

Because s only needs to be 2-byte aligned, the stride address isjust one byte after c, and struct foo4 as a whole only needs onebyte of trailing padding. It will be laid out like this:

struct foo4 {
    short s;     /* 2 bytes */
    char c;      /* 1 byte */
    char pad[1];
};

and sizeof(struct foo4) will return 4.

Now let’s consider bitfields. What they give you the ability to do isdeclare structure fields of smaller than character width, down to 1bit, like this:

struct foo5 {
    short s;
    char c;
    int flip:1;
    int nybble:4;
    int septet:7;
};

The thing to know about bitfields is that they are implemented withword- and byte-level mask and rotate instructions operating onmachine words, and cannot cross word boundaries.

From the compiler’s point of view, the bitfields in struct foo5 looklike a two-byte, 16-bit character array with only 12 bits in use.That in turn is followed by padding to make the byte length of thestructure a multiple of sizeof(short), the size of the longestmember.

struct foo5 {
    short s;       /* 2 bytes */
    char c;        /* 1 byte */
    int flip:1;    /* total 1 bit */
    int nybble:4;  /* total 5 bits */
    int septet:7;  /* total 12 bits */
    int pad1:4;    /* total 16 bits = 2 bytes */
    char pad2;     /* 1 byte */
};

The restriction that bitfields cannot cross machine word boundariesmeans that, while the first two of the following structures pack intoone and two 32-bit words as you’d expect, the third (struct foo8) takesup three 32-bit words, in the last of which only one bit is used.

On the other hand, struct foo8 would fit into a single 64-bit word.

struct foo6 {
    int bigfield:31;      /* 32-bit word 1 begins */
    int littlefield:1;
};

struct foo7 {
    int bigfield1:31;     /* 32-bit word 1 begins /*
    int littlefield1:1;
    int bigfield2:31;     /* 32-bit word 2 begins */
    int littlefield2:1;
};

struct foo8 {
    int bigfield1:31;     /* 32-bit word 1 begins */
    int bigfield2:31;     /* 32-bit word 2 begins */
    int littlefield1:1;
    int littlefield2:1;   /* 32-bit word 3 begins */
};

Here’s a last important detail: If your structure has structuremembers, the inner structs want to have the alignment oflongest scalar too. Suppose you write this:

struct foo9 {
    char c;
    struct foo9_inner {
        char *p;
        short x;
    } inner;
};

The char *p member in the inner struct forces the outer struct to bepointer-aligned as well as the inner. Actual layout will be likethis on a 64-bit machine:

struct foo9 {
    char c;           /* 1 byte*/
    char pad1[7];     /* 7 bytes */
    struct foo9_inner {
        char *p;      /* 8 bytes */
        short x;      /* 2 bytes */
        char pad2[6]; /* 6 bytes */
    } inner;
};

This structure gives us a hint of the savings that might be possiblefrom repacking structures. Of 24 bytes, 13 of them are padding.That’s more than 50% waste space!

6. Structure reordering

Now that you know how and why compilers insert padding in and after yourstructures we’ll examine what you can do to squeeze out the slop.This is the art of structure packing.

The first thing to notice is that slop only happens in two places.One is where storage bound to a larger data type (with stricteralignment requirements) follows storage bound to a smaller one. Theother is where a struct naturally ends before its stride address,requiring padding so the next one will be properly aligned.

The simplest way to eliminate slop is to reorder the structure membersby decreasing alignment. That is: make all the pointer-alignedsubfields come first, because on a 64-bit machine they will be 8bytes. Then the 4-byte ints; then the 2-byte shorts; then thecharacter fields.

So, for example, consider this simple linked-list structure:

struct foo10 {
    char c;
    struct foo10 *p;
    short x;
};

With the implied slop made explicit, here it is:

struct foo10 {
    char c;          /* 1 byte */
    char pad1[7];    /* 7 bytes */
    struct foo10 *p; /* 8 bytes */
    short x;         /* 2 bytes */
    char pad2[6];    /* 6 bytes */
};

That’s 24 bytes. If we reorder by size, we get this:

struct foo11 {
    struct foo11 *p;
    short x;
    char c;
};

Considering self-alignment, we see that none of the data fields needpadding. This is because the stride address for a (longer) field withstricter alignment is always a validly-aligned start address for a(shorter) field with less strict requirements. All the repacked structactually requires is trailing padding:

struct foo11 {
    struct foo11 *p; /* 8 bytes */
    short x;         /* 2 bytes */
    char c;          /* 1 byte */
    char pad[5];     /* 5 bytes */
};

Our repack transformation drops the size from 24 to 16 bytes. Thismight not seem like a lot, but suppose you have a linked list of 200Kof these? The savings add up fast - especially on memory-constrainedembedded systems or in the core part of an OS kernel that has to stayresident.

Note that reordering is not guaranteed to produce savings. Applying thistechnique to an earlier example, struct foo9, we get this:

struct foo12 {
    struct foo12_inner {
        char *p;      /* 8 bytes */
        int x;        /* 4 bytes */
    } inner;
    char c;           /* 1 byte*/
};

With padding written out, this is

struct foo12 {
    struct foo12_inner {
        char *p;      /* 8 bytes */
        int x;        /* 4 bytes */
        char pad[4];  /* 4 bytes */
    } inner;
    char c;           /* 1 byte*/
    char pad[7];      /* 7 bytes */
};

It’s still 24 bytes because c cannot back into the inner struct’strailing padding. To collect that gain you would need to redesignyour data structures.

Since shipping the first version of this guide I have been asked why,if reordering for minimal slop is so simple, C compilers don’t do itautomatically. The answer: C is a language originally designed forwriting operating systems and other code close to thehardware. Automatic reordering would interfere with a systemsprogrammer’s ability to lay out structures that exactly match thebyte and bit-level layout of memory-mapped device control blocks.

7. Awkward scalar cases

Using enumerated types instead of #defines is a good idea, if onlybecause symbolic debuggers have those symbols available and canshow them rather than raw integers. But, while enums are guaranteedto be compatible with an integral type, the C standard doesnot specify which underlying integral type is to be used for them.

Be aware when repacking your structs that while enumerated-typevariables are usually ints, this is compiler-dependent; they could beshorts, longs, or even chars by default. Your compiler may have apragma or command-line option to force the size.

The long double type is a similar trouble spot. Some C platformsimplement this in 80 bits, some in 128, and some of the 80-bitplatforms pad it to 96 or 128 bits.

In both cases it’s best to use sizeof() to check the storage size.

Finally, under x86 Linux doubles are sometimes an exception to theself-alignment rule; an 8-byte double may require only 4-bytealignment within a struct even though standalone doubles variableshave 8-byte self-alignment. This depends on compiler and options.

8. Readability and cache locality

While reordering by size is the simplest way to eliminate slop, it’snot necessarily the right thing. There are two more issues:readability and cache locality.

Programs are not just communications to a computer, they arecommunications to other human beings. Code readability is importanteven (or especially!) when the audience of the communication is onlyyour future self.

A clumsy, mechanical reordering of your structure can harmreadability. When possible, it is better to reorder fields so theyremain in coherent groups with semantically related pieces of datakept close together. Ideally, the design of your structure shouldcommunicate the design of your program.

When your program frequently accesses a structure, or parts of astructure, it is helpful for performance if the accesses tend to fitwithin a cache line - the memory block fetched by your processorwhen it is told to get any single address within the block. On 64-bitx86 a cache line is 64 bytes beginning on a self-aligned address; onother platforms it is often 32 bytes.

The things you should do to preserve readability - grouping relatedand co-accessed data in adjacent fields - also improve cache-linelocality. These are both reasons to reorder intelligently, withawareness of your code’s data-access patterns.

If your code does concurrent access to a structure from multiplethreads, there’s a third issue: cache line bouncing. To minimizeexpensive bus traffic, you should arrange your data so that readscome from one cache line and writes go to another in your tighterloops.

And yes, this sometimes contradicts the previous guidance about groupingrelated data in the same cache-line-sized block. Multithreading is hard.Cache-line bouncing and other multithread optimization issues are veryadvanced topics which deserve an entire tutorial of their own. Thebest I can do here is make you aware that these issues exist.

9. Other packing techniques

Reordering works best when combined with other techniques for slimmingyour structures. If you have several boolean flags in a struct, for example,consider reducing them to 1-bit bitfields and packing them into aplace in the structure that would otherwise be slop.

You’ll take a small access-time penalty for this - but if it squeezesthe working set enough smaller, that penalty will be swamped by yourgains from avoided cache misses.

More generally, look for ways to shorten data field sizes. Incvs-fast-export, for example, one squeeze I applied was to use theknowledge that RCS and CVS repositories didn’t exist before 1982. Idropped a 64-bit Unix time_t (zero date at the beginning of 1970)for a 32-bit time offset from 1982-01-01T00:00:00; this will coverdates to 2118. (Note: if you pull a trick like this, do a boundscheck whenever you set the field to prevent nasty bugs!)

Each such field shortening not only decreases the explicit sizeof your structure, it may remove slop and/or create additionalopportunities for gains from field reordering. Virtuous cascadesof such effects are not very hard to trigger.

The riskiest form of packing is to use unions. If you know thatcertain fields in your structure are never used in combination withcertain other fields, consider using a union to make them sharestorage. But be extra careful and verify your work with regressiontesting, because if your lifetime analysis is even slightly wrong youwill get bugs ranging from crashes to (much worse) subtle datacorruption.

10. Overriding alignment rules

Sometimes you can coerce your compiler into not using the processor’snormal alignment rules by using a pragma, usually #pragma pack.GCC and clang have an attributepacked you can attach toindividual structure declarations; GCC has an -fpack-struct optionfor entire compilations.

Do not do this casually, as it forces the generation of more expensiveand slower code. Usually you can save as much memory, or almost asmuch, with the techniques I describe here.

The only good reason for #pragma pack is if you have to exactlymatch your C data layout to some kind of bit-level hardware orprotocol requirement, like a memory-mapped hardware port, and violatingnormal alignment is required for that to work. If you’re in thatsituation, and you don’t already know everything else I’m writingabout here, you’re in deep trouble and I wish you luck.

11. Tools

The clang compiler has a -Wpadded option that causes it to generatemessages about alignment holes and padding. Some versions also havean undocumented -fdump-record-layouts option that yieldsmoreinformation.

I have not used it myself, but several respondents speak well of aprogram called pahole. This tool cooperates with a compiler to producereports on your structures that describe padding, alignment, andcache line boundaries.

I’ve received a report that a proprietary code auding tool called"PVS Studio" can detect structure-packing opportunities.

12. Proof and exceptional cases

You can download sourcecode for a little program that demonstrates theassertions about scalar and structure sizes made above. It ispacktest.c.

If you look through enough strange combinations of compilers, options,and unusual hardware, you will find exceptions to some of the rules Ihave described. They get more common as you go back in time to olderprocessor designs.

The next level beyond knowing these rules is knowing how and when toexpect that they will be broken. In the years when I learned them(the early 1980s) we spoke of people who didn’t get this as victims of"all-the-world’s-a-VAX syndrome". Remember that not all the world is aPC.

13. Related Reading

This section exists to collect pointers to essays on other advancedC topics which I judge to be good companions to this one.

A Guide to Undefined Behavior in C and C++

Time, Clock, and Calendar Programming In C

14. Version history

1.11 @ 2015-07-23
Mention the clang -fdump-record-layouts option.
1.10 @ 2015-02-20
Mention attribute packed, -fpack-struct, and PVS Studio.
1.9 @ 2014-10-01
Added link to "Time, Clock, and Calendar Programming In C".
1.8 @ 2014-05-20
Improved explanation for the bitfield examples,
1.7 @ 2014-05-17
Correct a minor error in the description of the layout of struct foo8.
1.6 @ 2014-05-14
Emphasize that bitfields cannot cross word boundaries. Idea from Dale Gulledge.
1.5 @ 2014-01-13
Explain why structure member reordering is not done automatically.
1.4 @ 2014-01-04
A note about double under x86 Linux.
1.3 @ 2014-01-03
New sections on awkward scalar cases, readability and cache locality, and tools.
1.2 @ 2014-01-02
Correct an erroneous address calculation.
1.1 @ 2014-01-01
Explain why aligned accesses are faster. Mention offsetof. Various minor fixes, including the packtest.c download link.
1.0 @ 2014-01-01
Initial release.


===================阅读笔记===================

http://blog.jobbole.com/57822/

C编译器在内存里对基本的C数据类型的存放方式是受约束的,为的是内存访问更快。

在x86或者ARM处理器上,基本的C数据类型的储存一般并不是起始于内存中的任意字节地址。而是,每种类型,除了字符型以外,都有对齐要求;字符可以起始于任何字节地址,但是2字节的短整型必须起始于一个偶数地址,4字节整型或者浮点型必须起始于被4整除的地址,以及8字节长整型或者双精度浮点型必须起始于被8整除的地址。带符号与不带符号之间没有差别。

这个的行话叫:在x86和ARM上,基本的C语言类型是自对齐(self-aligned)的。指针,无论是32位(4字节)亦或是64位(8字节)也都是自对齐的。

自对齐使得访问更快,因为它使得一条指令就完成对类型化数据的取和存操作。没有对齐的约束,反过来,代码最终可能会不得不跨越机器字的边界做两次或更多次访问。字符是特殊的情况;无论在一个单机器字中的何处,存取的花费都是一样的。那就是为什么字符型没有被建议对齐。


你可能感兴趣的:(The Lost Art of C Structure Packing)