Position Independent Code (PIC) in shared libraries

很好的文章,阐述x86上得位置无关代码的实现,翻译的不好,还望批评指正。

免费PDF文档下载地址:http://ishare.iask.sina.com.cn/f/35290470.html

或者http://wenku.baidu.com/view/506fe43a31126edb6f1a106a.html?st=1
I’ve described the need for special handling of shared libraries while loading them into the process’s address space in aprevious article. Briefly, when the linker creates a shared library, it doesn’t know in advance where it might be loaded. This creates a problem for the data and code references within the library, which should be somehow made to point to the correct memory locations.

在前篇文章中,我向大家介绍了装载时重定位技术的细节。 简单地说,共享对象在编译时不能假设自己在进程虚拟地址空间中的位置,所以当共享对象真正加载进内存后,就需要对其中的指令引用和数据访问进行重定位操作,最终使这些指令引用和数据访问指向正确的内存地址。

 

There are two main approaches to solve this problem in Linux ELF shared libraries:

  1. Load-time relocation
  2. Position independent code (PIC)

Linux ELF共享库来说,主要有两个手段来解决这个问题:

1.装载是重定位

2.位置无关代码(PIC

 

Load-time relocation was already covered. Here, I want to explain the second approach – PIC.

装载时重定位技术我们已经介绍过了。本文我向大家介绍位置无关代码 —— PIC

 

I originally planned to focus on both x86 and x64 (a.k.a. x86-64) in this article, but as it grew longer and longer I decided it won’t be practical. So, it will explain only how PIC works on x86, picking this older architecture specifically because (unlike x64) it wasn’t designed with PIC in mind, so implementing PIC on it is a bit trickier. A future (hopefully much shorter) article will build upon the foundation of this one to explain how PIC is implemented on x64.

我原来打算兼顾X86架构和X64架构(又名X86-64),可是考虑到篇幅太长,所以最终决定本文只介绍PICX86架构上的工作机制。 至于为什么选择X86这个有些老的架构(不像x64架构),主要是因为当初设计X86架构时,并没有考虑到PIC,所以最终使得PICX86架构上的实现使用了很多的技巧。 以后的文章中我会向大家介绍PICX64架构上的工作机制。

 

 

Some problems of load-time relocation

As we’ve seen in the previous article, load-time relocation is a fairly straightforward method, and it works. PIC, however, is much more popular nowadays, and is usually the recommended method of building shared libraries. Why is this so?

前文中,我们看到装载时重定位技术是个比较简单明了的手段,那为什么现在位置无关代码(PIC)更流行? 并且一般情况下,都会推荐使用位置无关代码技术来创建共享库,这是为什么呢?

 

Load-time relocation has a couple of problems: it takes time to perform, and it makes the text section of the library non-shareable.

装载时重定位主要存在两个问题:性能消耗  致使代码段无法在进程间共享

 

First, the performance problem. If a shared library was linked with load-time relocation entries, it will take some time to actually perform these relocations when the application is loaded. You may think that the cost shouldn’t be too large – after all, the loader doesn’t have to scan through the whole text section – it should only look at the relocation entries. But if a complex piece of software loads multiple large shared libraries at start-up, and each shared library must first have its load-time relocations applied, these costs can build up and result in a noticeable delay in the start-up time of the application.

第一点就是程序性能问题。 如果一个共享库是使用装载时重定位技术创建的,那么当把它加载进内存的时候,就势必会花些时间用来重定位。 你可能会想,这应该不会花多少时间吧 —— 毕竟,链接器不用扫描程序的真个代码段,它只需参考重定位表中的重定位入口就可以了。 但是,我们可以想象一下,如果一个复杂的程序需要在启动的时候加载大量的共享库,那么每个共享库都必须事先完成重定位工作后,程序才可以开始运行,可以想象,这势必会在程序启动时造成很明显的延迟。

 

Second, the non-shareable text section problem, which is somewhat more serious. One of the main points of having shared libraries in the first place, is saving RAM. Some common shared libraries are used by multiple applications. If the text section (where the code is) of the shared library can only be loaded into memory once (and then mapped into the virtual memories of many processes), considerable amounts of RAM can be saved. But this is not possible with load-time relocation, since when using this technique the text section has to be modified at load-time to apply the relocations. Therefore, for each application that loaded this shared library, it will have to be wholly placed in RAM again[1]. Different applications won’t be able to really share it.

第二点就是代码段无法共享的问题,这个问题某种程度上更严重。 我们知道,使用共享库其中一个主要的原因就是节省内存 —— 这可以通过多个进程共享相同的共享库来实现,比方说,一个共享库的代码段(即程序指令)只需加载到内存一次,之后可以映射到多个进程的虚拟内存空间中,那么这就可以节省很多内存。 可惜装载时重定位技术无法做到这点,因为共享库被加载进内存时,重定位操作会修改代码段指令。 也就是说,当一个程序运行过程中需要这个共享库时,这个程序必须重新加载共享库(不管内存中是否已经有这个共享库),因此,利用装载时重定位技术创建的共享库无法实现多进程间共享。

 

Moreover, having a writable text section (it must be kept writable, to allow the dynamic loader to perform the relocations) poses a security risk, making it easier to exploit the application.

更糟的是,这种情况下,代码段是可写的(为什么是可写的呢?—— 因为它必须允许动态链接器进行重定位操作,而重定位操作势必要修改代码),这就会导致很多安全隐患,使得对程序的攻击变得很容易。

 

As we’ll see in this article, PIC mostly mitigates these problems.

在接下来的内容,我们将看到位置无关代码PIC)是如何解决上面两个问题的。

 

 

PIC – introduction

The idea behind PIC is simple – add an additional level of indirection to all global data and function references in the code. By cleverly utilizing some artifacts of the linking and loading processes, it’s possible to make the text section of the shared library truly position independent, in the sense that it can be easily mapped into different memory addresses without needing to change one bit. In the next few sections I will explain in detail how this feat is achieved.

位置无关代码PIC)的设计思路很简单 —— 就是在代码中增加一个“中间层”来使得对全局数据的访问以及函数的引用变成是间接的,而不是直接访问或引用。 只要巧妙地利用链接器和加载器的一些特性,实现共享库的代码段的真正共享性是有可能的。 一旦实现了,那么共享库就可以容易地映射到不同的虚拟内存空间中去,而不需要改变程序代码段的指令。 接下来,我会向大家详细地介绍位置无关代码PIC)是如何实现的。

 

 

Key insight #1 – offset between text and data sections

One of the key insights on which PIC relies is the offset between the text and data sections, known to the linkerat link-time. When the linker combines several object files together, it collects their sections (for example, all text sections get unified into a single large text section). Therefore, the linker knows both about the sizes of the sections and about their relative locations.

实现位置无关代码PIC)所依赖的理论依据其中之一是:在链接阶段,链接器就已经知道代码段和数据段之间的偏移。 当链接器将若干个目标文件链接在一起时,它会将相似段合并(例如,将所有的代码段合并成一个大的段,段的名称依然叫代码段)。 所以,链接器是知道每个段的大小以及段与段之间的相对位置的。

 

For example, the text section may be immediately followed by the data section, so the offset from any given instruction in the text section to the beginning of the data section is just the size of the text section minus the offset of the instruction from the beginning of the text section – and both these quantities are known to the linker.

来看一个例子,假设代码段之后紧接着就是数据段,那么代码段中的任何一条指令与数据段的开始之间的偏移就是代码段的大小减去指令到代码段开始的偏移 —— 当然,这两个值链接器都是知道的。

                                               

In the diagram above, the code section was loaded into some address (unknown at link-time) 0xXXXX0000 (the X-es literally mean "don’t care"), and the data section right after it at offset 0xXXXXF000. Then, if some instruction at offset 0×80 in the code section wants to reference stuff in the data section, the linker knows the relative offset (0xEF80 in this case) and can encode it in the instruction.

在上图中,可以看到代码段的加载地址(这个地址在链接阶段是不知道的)是0xXXXX0000X代表任意值),而且数据段紧跟其后,加载地址为0xXXXXF000。 如果代码段内偏移0x80处的指令需要访问数据段中的数据,那么链接器就会知道指令与所需访问数据之间的相对偏移(在这里相对偏移是0xEF80),并且将这个相对偏移硬编码于指令中。

 

Note that it wouldn’t matter if another section was placed between the code and data sections, or if the data section preceded the code section. Since the linker knows the sizes of all sections and decides where to place them, the insight holds.

注意,就算代码段与数据段之间有其他的段,或者数据段在代码段之前都是没有关系的,因为链接器知道所有段的大小以及相对位置,所以上述理论总是成立的。

 

Key insight #2 – making an IP-relative offset work on x86

The above is only useful if we can actually put the relative offset to work. But data references (i.e. in themov instruction) on x86 require absolute addresses. So, what can we do?

上面的理论依据只有在我们需要相对偏移时才有用,可是在x86架构上的数据访问却需要数据的绝对地址(例如mov指令),那我们怎么做呢?

 

If we have a relative address and need an absolute address, what’s missing is the value of the instruction pointer (since, by definition, therelative address is relative to the instruction’s location). There’s no instruction to obtain the value of the instruction pointer on x86, but we can use a simple trick to get it. Here’s some assembly pseudo-code that demonstrates it:

如果已知相对地址,然后需要其绝对地址,那么我们还需要知道的就是指令指针(instruction pointer)的值(因为依据定义,相对地址是相对于指令位置的地址)。 遗憾的是X86架构没有直接获得指令指针的值的指令,不过我们可以利用一个小技巧来获得,如下面的汇编伪代码所示:

    call TMPLABEL
TMPLABEL:
    pop ebx

What happens here is:

  1. The CPU executes call TMPLABEL, which causes it to save the address of the next instruction (thepop ebx) on stack and jump to the label.
  2. Since the instruction at the label is pop ebx, it gets executed next. It pops a value from the stack intoebx. But this value is the address of the instruction itself, soebx now effectively contains the value of the instruction pointer.

解释如下:

1.CPU执行了call TMPLABEL,所以会将下一条指令(就是pop ebx)的地址压入栈顶,然后跳到TMPLABEL标签处。

2.因为在标签TMPLABEL处的指令是pop ebx,所以接下来就执行它,它会从栈顶取出一个值存入寄存器ebx。 可以知道这个值就是这条指令本身的地址,所以此时ebx实际上就包含了指令指针的值。

 

 

The Global Offset Table (GOT)

With this at hand, we can finally get to the implementation of position-independent data addressing on x86. It is accomplished by means of a "global offset table", or in short GOT.

理解了上面所说的,我们终于可以开始看看位置无关代码PIC)是如何在X86架构上实现的了。主要是利用全局偏移表global offset table)来实现的,全局偏移表简称GOT

 

A GOT is simply a table of addresses, residing in the data section. Suppose some instruction in the code section wants to refer to a variable. Instead of referring to it directly by absolute address (which would require a relocation), it refers to an entry in the GOT. Since the GOT is in a known place in the data section, this reference is relative and known to the linker. The GOT entry, in turn, will contain the absolute address of the variable:

一个GOT就是一个简单的指针数组,位于数据段中。 假设代码段中有一些指令需要访问数据,那么它们不会使用绝对地址(因为这需要重定位操作),而是会引用GOT中的一个项。 因为GOT位于数据段中,所以链接器知道对GOT中项的引用是使用的相对地址。GOT中的项实际就是变量的绝对地址:

                                                 Position Independent Code (PIC) in shared libraries_第1张图片 

              

In pseudo-assembly, we replace an absolute addressing instruction:

下面的伪代码中,我们是通过GOT而不是直接使用绝对地址访问变量:

; Place the value of the variable in edx
mov edx, [ADDR_OF_VAR]

With displacement addressing from a register, along with an extra indirection:

首先利用一个寄存器(该寄存器包含的是GOT的首地址)来基址寻址定位变量的地址,然后通过间接寻址取出变量的值:

; 1. Somehow get the address of the GOT into ebx
lea ebx, ADDR_OF_GOT

; 2. Suppose ADDR_OF_VAR is stored at offset 0x10
;    in the GOT. Then this will place ADDR_OF_VAR
;    into edx.
mov edx, DWORD PTR [ebx + 0x10]

; 3. Finally, access the variable and place its
;    value into edx.
mov edx, DWORD PTR [edx]

So, we’ve gotten rid of a relocation in the code section by redirecting variable references through the GOT. But we’ve also created a relocation in the data section. Why? Because the GOT still has to contain the absolute address of the variable for the scheme described above to work. So what have we gained?

看,我们通过GOT间接的访问数据,而不再需要重定位代码段的指令了。不过别高兴的太早,我们还是要在数据段进行重定位操作的,为什么呢? 因为位置无关代码最终要成功执行的话,GOT中包含的必须是变量的绝对地址。 如果这样,那我们为什么要多此一举呢?或者说,这样做的好处在哪里?

 

A lot, it turns out. A relocation in the data section is much less problematic than one in the code section, for two reasons (which directly address the two main problems of load-time relocation of code described in the beginning of the article):

结果证明好处是很多的。 在数据段的重定位与在代码段的重定位,相比之下前者会给我们带来更少的麻烦,主要有两点原因(这两点原因正是针对文章一开始提出的装载时重定位存在的两个问题):

  1. Relocations in the code section are required per variable reference, while in the GOT we only need to relocate onceper variable. There are likely much more references to variables than variables, so this is more efficient.
  2. The data section is writable and not shared between processes anyway, so adding relocations to it does no harm. Moving relocations from the code section, however, allows to make it read-only and share it between processes.

1.如果是代码段的重定位,那么链接器会为代码中每一次的变量引用执行重定位操作,而如果使用GOT的话,只需为每一个变量执行一次重定位操作。因为程序中极有可能会对一个变量引用多次,那么只执行一次重定位操作,势必会在程序启动阶段节约大量的时间。

2.因为数据段是可写的,并且在进程间是不共享的,所以在数据段执行重定位操作并没有什么伤害。再者,将重定位操作从代码段移至数据段,就可以将代码段设置成可读的,并且可以在多个进程间共享。

 

 

PIC with data references through GOT – an example

I will now show a complete example that demonstrates the mechanics of PIC:

接下来,我会通过一个具体的例子来展现PIC机制:

int myglob = 42;

int ml_func(int a, int b)
{
    return myglob + a + b;
}

This chunk of code will be compiled into a shared library (using the-fpic and-shared flags as appropriate) namedlibmlpic_dataonly.so.

将这段代码编译生成(使用-fpic-shared选项)共享库libmlpic_dataobly.so

 

Let’s take a look at its disassembly, focusing on the ml_func function:

让我们看看函数ml_func的反汇编代码:

0000043c <ml_func>:
 43c:   55                      push   ebp
 43d:   89 e5                   mov    ebp,esp
 43f:   e8 16 00 00 00          call   45a <__i686.get_pc_thunk.cx>
 444:   81 c1 b0 1b 00 00       add    ecx,0x1bb0
 44a:   8b 81 f0 ff ff ff       mov    eax,DWORD PTR [ecx-0x10]
 450:   8b 00                   mov    eax,DWORD PTR [eax]
 452:   03 45 08                add    eax,DWORD PTR [ebp+0x8]
 455:   03 45 0c                add    eax,DWORD PTR [ebp+0xc]
 458:   5d                      pop    ebp
 459:   c3                      ret

0000045a <__i686.get_pc_thunk.cx>:
 45a:   8b 0c 24                mov    ecx,DWORD PTR [esp]
 45d:   c3                      ret

I’m going to refer to instructions by their addresses (the left-most number in the disassembly). This address is the offset from the load address of the shared library.

最左边的一列地址是在共享库内的偏移:

  • At 43f, the address of the next instruction is placed intoecx, by means of the technique described in the "key insight #2" section above.
  • At 444, a known constant offset from the instruction to the place where the GOT is located is added toecx. Soecx now serves as a base pointer to GOT.
  • At 44a, a value is taken from [ecx - 0x10], which is a GOT entry, and placed into eax. This is the address of myglob.
  • At 450 the indirection is done, and the value of myglob is placed into eax.
  • Later the parameters a and b are added to myglob and the value is returned (by keeping it ineax).

43f处,将下条指令的地址存入寄存器ecx,依据的理论就是上面的"key insight #2"

444处,寄存器ecx存储的值加上一个已知常量(这个常量就是本条指令与GOT之间的距离),那么现在ecx就指向GOT

44a处,从GOT中取出一个值(地址为ecx - 0x10),存入寄存器eax,这正是变量myglob的地址。

450处,通过一个间接访问,取出变量myglob的值存入寄存器eax

最后,变量myglob加上ab,并且结果作为返回值(结果存储在寄存器eax中)

 

We can also query the shared library with readelf-S to see where the GOT section was placed:

我们也可以通过readelf -S命令来查看GOT段在共享库中的位置:

Section Headers:
  [Nr] Name     Type            Addr     Off    Size   ES Flg Lk Inf Al
  <snip>
  [19] .got     PROGBITS        00001fe4 000fe4 000010 04  WA  0   0  4
  [20] .got.plt PROGBITS        00001ff4 000ff4 000014 04  WA  0   0  4
  <snip>

Let’s do some math to check the computation done by the compiler to find myglob. As I mentioned above, the call to __i686.get_pc_thunk.cx places the address of the next instruction intoecx. That address is0x444[2]. The next instruction then adds0x1bb0 to it, and the result inecx is going to be0x1ff4. Finally, to actually obtain the GOT entry holding the address ofmyglob, displacement addressing is used –[ecx - 0x10], so the entry is at0x1fe4, which is the first entry in the GOT according to the section header.

我们计算一下,检查看看编译器的计算是否能正确地找到变量myglob。正如上文提到的,调用__i686.get_pc_thunk.cx会将下条指令的地址存入寄存器ecx,这个地址值为0x444。接着下条指令在这个值上加上0x1bb0,所以ecx的值变成为0x1ff4。 最后在GOT中找到变量myglob的地址 —— 这是通过基址寻址[ecx - 0x10]找到的,也就是说变量myglob的地址存储在0x1fe4处,从上面的输出可以看出这是GOT中的第一项。

 

Why there’s another section whose name starts with .got will be explained later in the article [3]. Note that the compiler chooses to point ecx to after the GOT and then use negative offsets to obtain entries. This is fine, as long as the math works out. And so far it does.

那怎么还有另一个以.got开始为段名的段呢? 这点下文会解释。 这里只需知道编译器选择让寄存器ecx指向GOT后面的位置,然后通过一个负的位移来获得一个项,只要没有算错,程序就能正常的工作。

 

There’s something we’re still missing, however. How does the address of myglob actually get into the GOT slot at 0x1fe4? Recall that I mentioned a relocation, so let’s find it:

那么,你有没有想过,变量myglob的绝对地址是如何存入GOT中的呢? 想起之前提到的重定位表,让我们把它找出来:

> readelf -r libmlpic_dataonly.so

Relocation section '.rel.dyn' at offset 0x2dc contains 5 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00002008  00000008 R_386_RELATIVE
00001fe4  00000406 R_386_GLOB_DAT    0000200c   myglob
<snip>

Note the relocation section for myglob, pointing to address0x1fe4, as expected. The relocation is of typeR_386_GLOB_DAT, which simply tells the dynamic loader – "put the actual value of the symbol (i.e. its address) into that offset". So everything works out nicely. All that’s left is to check how it actually looks when the library is loaded. We can do this by writing a simple "driver" executable that links to libmlpic_dataonly.so and calls ml_func, and then running it through GDB.

注意关于变量myglob的重定位入口,它的偏移是0x1fe4,果然与上面的计算相同。 它的重定位类型是R_386_GLOB_DAT,这个类型告诉链接器直接将符号的地址值放入偏移处。所以一切都顺利地完成。现在需要做的就是观察共享库真实加载进内存时,都发生了什么。让我们写个简单的 "driver"程序,这个程序会调用共享库libmlpic_dataonly.so中的函数ml_func,然后在GDB中调试程序:

> gdb driver
[...] skipping output
(gdb) set environment LD_LIBRARY_PATH=.
(gdb) break ml_func
[...]
(gdb) run
Starting program: [...]pic_tests/driver

Breakpoint 1, ml_func (a=1, b=1) at ml_reloc_dataonly.c:5
5         return myglob + a + b;
(gdb) set disassembly-flavor intel
(gdb) disas ml_func
Dump of assembler code for function ml_func:
   0x0013143c <+0>:   push   ebp
   0x0013143d <+1>:   mov    ebp,esp
   0x0013143f <+3>:   call   0x13145a <__i686.get_pc_thunk.cx>
   0x00131444 <+8>:   add    ecx,0x1bb0
=> 0x0013144a <+14>:  mov    eax,DWORD PTR [ecx-0x10]
   0x00131450 <+20>:  mov    eax,DWORD PTR [eax]
   0x00131452 <+22>:  add    eax,DWORD PTR [ebp+0x8]
   0x00131455 <+25>:  add    eax,DWORD PTR [ebp+0xc]
   0x00131458 <+28>:  pop    ebp
   0x00131459 <+29>:  ret
End of assembler dump.
(gdb) i registers
eax            0x1    1
ecx            0x132ff4       1257460
[...] skipping output

The debugger has entered ml_func, and stopped at IP0x0013144a[4]. We see thatecx holds the value0x132ff4 (which is the address of the instruction plus0x1bb0, as explained before). Note that at this point, at runtime, these are absolute addresses – the shared library has already been loaded into the address space of the process.

调试器停止在函数ml_func中,并且指向地址0x0013144a。我们可以看到,寄存器ecx的值为0x132ff4(这个值正是地址值0x00121444加上0x1bb0所得,和上文所述完全相同)。注意此时共享库已经加载进进程的虚拟地址空间中,所以这些地址都是绝对地址。

 

So, the GOT entry for myglob is at [ecx - 0x10]. Let’s check what’s there:

因此,变量myglob的地址存储在[ecx - 0x10],让我们查看那里是什么:

(gdb) x 0x132fe4
0x132fe4:     0x0013300c

So, we’d expect 0x0013300c to be the address ofmyglob. Let’s verify:

得到变量myglob的地址为0x0013300c,让我们验证一下:

(gdb) p &myglob
$1 = (int *) 0x13300c

Indeed, it is!

果真如此!

 

 

Function calls in PIC

Alright, so this is how data addressing works in position independent code. But what about function calls? Theoretically, the exact same approach could work for function calls as well. Instead ofcall actually containing the address of the function to call, let it contain the address of a known GOT entry, and fill in that entry during loading.

好了,这就是如何在位置无关代码中引用变量的方法。但是如何在其中引用函数呢? 理论上,使用引用变量的方法同样可以完成函数的引用 —— 不是直接使用函数的绝对地址,而是在加载时将绝对地址存储在GOT中。

 

But this is not how function calls work in PIC. What actually happens is a bit more complicated. Before I explain how it’s done, a few words about the motivation for such a mechanism.

可是,惊奇的是,PIC中并不是这么做的,相对变量的引用,对函数的引用稍微有些复杂。在解释PIC是如何实现的之前,让我们先了解下这样做的优势在哪?

 

The lazy binding optimization

When a shared library refers to some function, the real address of that function is not known until load time. Resolving this address is calledbinding, and it’s something the dynamic loader does when it loads the shared library into the process’s memory space. This binding process is non-trivial, since the loader has to actuallylook up the function symbol in special tables[5].

共享库引用某个函数,在其未加载之前,函数的绝对地址是无法知道的。 一旦共享库加载入进程的虚拟地址空间中,链接器就会解析函数的绝对地址,这个解析的过程被称为绑定binding 绑定的过程并不轻松,因为链接器需要在特殊的表中查找函数符号。

 

So, resolving each function takes time. Not a lot of time, but it adds up since the amount of functions in libraries is typically much larger than the amount of global variables. Moreover, most of these resolutions are done in vain, because in a typical run of a program only a fraction of functions actually get called (think about various functions handling error and special conditions, which typically don’t get called at all).

所以,函数绑定是需要时间的。 因为在动态链接下,程序模块之间包含了大量的函数引用(比对全局变量的引用多得多,全局变量往往比较少,因为大量的全局变量会导致模块之间耦合度变大),如果程序运行前把所有的函数都绑定好的话,这势必会导致浪费时间。并且,很多的绑定都是没有用的,为什么呢?因为一般程序运行时只须调用少量的函数(譬如那些错误处理函数或者用户很少用到的功能模块就很少用到,或者根本用不到),所以没必要一下子绑定所有的函数。

 

So, to speed up this process, a clever lazy binding scheme was devised. "Lazy" is a generic name for a family of optimizations in computer programming, where work is delayed until the last moment when it’s actually needed, with the intention of avoiding doing this work if its results are never required during a specific run of a program. Good examples of laziness arecopy-on-write and lazy evaluation.

所以,为了提高程序运行的速度,ELF采用了一种叫做延迟绑定的做法。“延迟”在程序优化领域是个通用名,意思就是说,只有真正用到的时候才开始运作,避免无用功。使用这种机制的很好例子有写时拷贝(copy-on-write)以及延迟计算(lazy evaluation)。

 

This lazy binding scheme is attained by adding yet another level of indirection – the PLT.

延迟绑定是通过PLT这个“中间层”来实现的。

 

The Procedure Linkage Table (PLT)

The PLT is part of the executable text section, consisting of a set of entries (one for each external function the shared library calls). Each PLT entry is a short chunk of executable code. Instead of calling the function directly, the code calls an entry in the PLT, which then takes care to call the actual function. This arrangement is sometimes called a "trampoline". Each PLT entry also has a corresponding entry in the GOT which contains the actual offset to the function, but only when the dynamic loader resolves it. I know this is confusing, but hopefully it will be come clearer once I explain the details in the next few paragraphs and diagrams.

PLT是代码段的一部分,其中包含若干个项(每一个项对应共享库中的一个函数引用),每个项都是一小段可执行代码。 那么当调用函数时,就不再直接调用函数了,而是引用PLT中的一个项,从而实现函数的间接调用,这种方法有时也被称为“弹簧垫”(trampoline)。每一个PLT项在GOT中都有一个相应的项,当链接器完成解析之后,这个项就是函数的真实绝对地址。 我知道说的有些乱,不过我相信看过下面的图之后你会懂得这些的。

 

As the previous section mentioned, PLTs allow lazy resolution of functions. When the shared library is first loaded, the function calls have not been resolved yet:

我们知道,PLT是用来实现延迟绑定的,所以当共享库第一次被加载时,其对函数的引用还没有绑定:

                                             Position Independent Code (PIC) in shared libraries_第2张图片

Explanation:

  • In the code, a function func is called. The compiler translates it to a call tofunc@plt, which is some N-th entry in the PLT.
  • The PLT consists of a special first entry, followed by a bunch of identically structured entries, one for each function needing resolution.
  • Each PLT entry but the first consists of these parts:
    • A jump to a location which is specified in a corresponding GOT entry
    • Preparation of arguments for a "resolver" routine
    • Call to the resolver routine, which resides in the first entry of the PLT
  • The first PLT entry is a call to a resolver routine, which is located in the dynamic loader itself[6]. This routine resolves the actual address of the function. More on its action a bit later.
  • Before the function’s actual address has been resolved, the Nth GOT entry just points to after the jump. This is why this arrow in the diagram is colored differently – it’s not an actual jump, just a pointer

对上图解释如下:

在代码中,调用了函数func,编译器将其翻译成对func@plt的调用,这是PLT中的某一项。

PLT的第一项很特殊,在其后是一些相同的项,每个项对应一个函数引用。

出了PLT的第一项,其它项包括下面三部分:

a. 根据GOT中对应的项,执行一个跳转

b. 为“解析”程序准备参数

c. 跳转到PLT的第一项,并且调用解析程序

PLT的第一项中,会调用解析函数,这个解析函数是动态链接器的一部分。这个解析函数会解析处函数的真实地址,至于是如何解析的,稍后会解释。

在函数真实地址被解析出来之前,GOT相对应项中包含的项其实是指向jump之后的那条指令。这就解释了为什么我会将图片中的那个箭头标为不同的颜色了 —— 它其实不是真正的跳转,只是一个pointer

 

What happens when func is called for the first time is this:

  • PLT[n] is called and jumps to the address pointed to inGOT[n].
  • This address points into PLT[n] itself, to the preparation of arguments for the resolver.
  • The resolver is then called.
  • The resolver performs resolution of the actual address of func, places its actual address into GOT[n] and callsfunc.

当第一次调用函数func时会发生什么呢?

跳转到PLT[n]处,根据jump指令跳转到GOT[n]指定的地址处。

这个指定的地址使程序跳转回PLT[n],接着为解析函数准备参数

接着解析函数被调用

解析函数解析出函数的真实地址,并且将这个地址写入GOT[n]中,接着才根据GOT[n]真正地调用函数func

 

After the first call, the diagram looks a bit differently:

第一次调用函数后,图片就有所不同了:

 

                                                Position Independent Code (PIC) in shared libraries_第3张图片

Note that GOT[n] now points to the actual func [7] instead of back into the PLT. So, whenfunc is called again:

  • PLT[n] is called and jumps to the address pointed to inGOT[n].
  • GOT[n] points to func, so this just transfers control to func.

注意到现在GOT[n]真实地指向函数func而不是回到PLT中,因此当再次调用函数func时:

跳转到PLT[n]处,根据jump指令跳转到GOT[n]指定的地址处。

GOT[n]指向函数func,所以程序控制权交给函数func

 

In other words, now func is being actually called, without going through the resolver, at the cost of one additional jump. That’s all there is to it, really. This mechanism allows lazy resolution of functions, and no resolution at all for functions that aren’t actually called.

也就是说,现在函数func是真正地被调用了,不再需要解析函数来解析它的地址了,只需一个额外的跳转即可。这就是延迟绑定的全部内容了 —— 延迟绑定只解析需要调用的函数,而不解析那些还没调用或者从不调用的函数。

 

It also leaves the code/text section of the library completely position independent, since the only place where an absolute address is used is the GOT, which resides in the data section and will be relocated by the dynamic loader. Even the PLT itself is PIC, so it can live in the read-only text section.

延迟绑定使得共享库的代码段也是位置无关的,因为绝对地址都存储在GOT中,而GOT存储在数据段中,加载时链接器会对其进行重定位操作。甚至连PLT自身也是位置无关的,所以它可以存储在属性只读的代码段中。

 

I didn’t get into much details regarding the resolver, but it’s really not important for our purpose here. The resolver is simply a chunk of low-level code in the loader that does symbol resolution. The arguments prepared for it in each PLT entry, along with a suitable relocation entry, help it know about the symbol that needs resolution and about the GOT entry to update.

在这里,我并没有向大家介绍解析函数是如何解析出最终的地址的,因为这并不影响我们对位置无关代码机制的理解。 解析函数是用低级语言写的,本身是链接器的一部分,用来解析符号。在PLT中的每一项(出了第一项)会为其准备参数,指定合适的重定位入口,用来帮助解析函数知道解析哪个符号以及更新哪个GOT项。

 

PIC with function calls through PLT and GOT – an example

Once again, to fortify the hard-learned theory with a practical demonstration, here’s a complete example showing function call resolution using the mechanism described above. I’ll be moving forward a bit faster this time.

再一次,我们利用一个实际的例子来加强前面对理论的学习,这个例子展示利用上述理论函数引用是如何被解析的:

Here’s the code for the shared library:

int myglob = 42;

int ml_util_func(int a)
{
    return a + 1;
}

int ml_func(int a, int b)
{
    int c = b + ml_util_func(a);
    myglob += c;
    return b + myglob;
}

This code will be compiled into libmlpic.so, and the focus is going to be on the call toml_util_func fromml_func. Let’s first disassembleml_func:

将上面代码编译生成共享库libmlpic.so,然后让我们着重看看函数ml_func中对函数ml_util_func的调用语句,反汇编如下:

00000477 <ml_func>:
 477:   55                      push   ebp
 478:   89 e5                   mov    ebp,esp
 47a:   53                      push   ebx
 47b:   83 ec 24                sub    esp,0x24
 47e:   e8 e4 ff ff ff          call   467 <__i686.get_pc_thunk.bx>
 483:   81 c3 71 1b 00 00       add    ebx,0x1b71
 489:   8b 45 08                mov    eax,DWORD PTR [ebp+0x8]
 48c:   89 04 24                mov    DWORD PTR [esp],eax
 48f:   e8 0c ff ff ff          call   3a0 <ml_util_func@plt>
 <... snip more code>

The interesting part is the call to ml_util_func@plt. Note also that the address of GOT is inebx. Here’s whatml_util_func@plt looks like (it’s in an executable section called.plt):

值得注意的是对ml_util_func@plt的调用,同样值得注意的是GOT的地址存储在寄存器ebx中。下面是ml_util_func@plt的代码(存在于.plt):

000003a0 <ml_util_func@plt>:
 3a0:   ff a3 14 00 00 00       jmp    DWORD PTR [ebx+0x14]
 3a6:   68 10 00 00 00          push   0x10
 3ab:   e9 c0 ff ff ff          jmp    370 <_init+0x30>

Recall that each PLT entry consists of three parts:

  • A jump to an address specified in GOT (this is the jump to [ebx+0x14])
  • Preparation of arguments for the resolver
  • Call to the resolver

回想上文提到的PLT项包含以下三部分:

根据GOT指定的地址跳转(这里跳转到[ebx + 0x14]

为解析函数准备参数

调用解析函数

 

The resolver (PLT entry 0) resides at address 0x370, but it’s of no interest to us here. What’s more interesting is to see what the GOT contains. For that, we first have to do some math.

PLT位于地址0x370处,不过我们这里并不关心它,因为我们更感兴趣的是GOT中的内容。为此我们需要先计算一下。

 

The "get IP" trick in ml_func was done on address0x483, to which0x1b71 is added. So the base of the GOT is at0x1ff4. We can take a peek at the GOT contents withreadelf[8]:

"get IP" 技巧在函数ml_func中的地址为0x483,再加上0x1b71,所以得出GOT的地址为0x1ff4,我们可以通过readelf命令查看:

> readelf -x .got.plt libmlpic.so

Hex dump of section '.got.plt':
  0x00001ff4 241f0000 00000000 00000000 86030000 $...............
  0x00002004 96030000 a6030000                   ........

The GOT entry ml_util_func@plt looks at is at offset+0x14, or0x2008. From above, the word at that location is0x3a6, which is the address of thepush instruction in ml_util_func@plt.

GOT中关于ml_util_func@plt的项的位置为offset + 0x14,即0x2008。从上面的输出可以看出这个位置的内容为0x3a6,这个值正是ml_util_func@pltpush指令的地址。

 

To help the dynamic loader do its job, a relocation entry is also added and specifies which place in the GOT to relocate forml_util_func:

为了链接器完成工作,重定位表中也有关于函数ml_util_func的重定位入口,用来指定GOT中那个位置是关于函数ml_util_func重定位的:

> readelf -r libmlpic.so
[...] snip output

Relocation section '.rel.plt' at offset 0x328 contains 3 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00002000  00000107 R_386_JUMP_SLOT   00000000   __cxa_finalize
00002004  00000207 R_386_JUMP_SLOT   00000000   __gmon_start__
00002008  00000707 R_386_JUMP_SLOT   0000046c   ml_util_func

The last line means that the dynamic loader should place the value (address) of symbolml_util_func into0x2008 (which, recall, is the GOT entry for this function).

注意输出的最后一行,意思就是告诉链接器将符号的地址存储在地址0x2008处(回想一下,这个值正是GOT项的地址)

 

It would be interesting to see this GOT entry modification actually happen after the first call. Let’s once again use GDB for the inspection.

第一次调用函数时,观察GOT项值的改变会是一件非常有意思的事情。让我们再次使用GDB观察:

> gdb driver
[...] skipping output
(gdb) set environment LD_LIBRARY_PATH=.
(gdb) break ml_func
Breakpoint 1 at 0x80483c0
(gdb) run
Starting program: /pic_tests/driver

Breakpoint 1, ml_func (a=1, b=1) at ml_main.c:10
10        int c = b + ml_util_func(a);
(gdb)

We’re now before the fist call to ml_util_func. Recall that GOT is pointed to byebx in this code. Let’s see what’s in it:

我们停在调用函数ml_util_func之前,我们知道此时ebx应该指向GOT,让我们看看ebx的值:

(gdb) i registers ebx
ebx            0x132ff4

And the offset to the entry we need is at [ebx+0x14]:

在查看ebx+0x14地址处的内容:

(gdb) x/w 0x133008
0x133008:     0x001313a6

Yep, the 0x3a6 ending, looks right. Now, let’s step until after the call toml_util_func and check again:

看见没,是以0x3a6结尾的,应该是正确的。现在让我们进入函数ml_util_func,再次查看:

(gdb) step
ml_util_func (a=1) at ml_main.c:5
5         return a + 1;
(gdb) x/w 0x133008
0x133008:     0x0013146c


The value at 0x133008 was changed. Hence, 0x0013146c should be the real address of ml_util_func, placed in there by the dynamic loader:

内存地址0x133008处的内容发生了改变,因此0x0013146c应该就是函数ml_util_func的真实地址了,这是链接器所作的:

(gdb) p &ml_util_func
$1 = (int (*)(int)) 0x13146c <ml_util_func>

Just as expected.

果真如此!

 

Controlling if and when the resolution is done by the loader

This would be a good place to mention that the process of lazy symbol resolution performed by the dynamic loader can be configured with some environment variables (and corresponding flags told when linking the shared library). This is sometimes useful for special performance requirements or debugging.

动态链接器的这种符号延迟解析机制可被一些环境变量控制(当链接器创建共享库时同样有相应的环境变量可以控制创建过程),这些变量有时对于有特殊性能要求的程序或者调试时非常有用。

 

The LD_BIND_NOW env var, when defined, tells the dynamic loader to always perform the resolution for all symbols at start-up time, and not lazily. You can easily verify this in action by setting this env var and re-running the previous sample with GDB. You’ll see that the GOT entry for ml_util_func contains its real address even before the first call to the function.

环境变量LD_BIND_NOW告诉动态链接器在程序启动时解析所有的符号,而不采用延迟解析。你可以很容易的在GDB中设置这个变量,并且验证该变量的作用。 你将会看到GOT关于函数ml_util_func的项在第一次调用时就已经存在该函数的真实地址了。

 

Conversely, the LD_BIND_NOT env var tells the dynamic loader not to update the GOT entry at all. Each call to an external function will then go through the dynamic loader and be resolved anew.

相反的,环境变量LD_BIND_NOT告诉动态链接器不要去更新GOT项。因此每一次引用外部函数时都需要动态链接器去解析符号的地址。

 

The dynamic loader is configurable by other flags as well. I encourage you to go overman ld.so – it contains some interesting information.

当然还有其他变量可以控制动态链接器的行为。我推荐你去参阅关于ld.soman文档 —— 那里包含更多有趣的信息。

 

The costs of PIC

This article started by stating the problems of load-time relocation and how the PIC approach fixes them. But PIC is also not without problems. One immediately apparent cost is the extra indirection required for all external references to data and code in PIC. That’s an extra memory load for each reference to a global variable, and for each call to a function. How problematic this is in practice depends on the compiler, the CPU architecture and the particular application.

本文先提出装载时重定位存在的问题,然后详细谈了位置无关代码PIC)是如何解决这些问题的。 但是位置无关代码PIC)也不是没有问题。 一个直接而明显的问题就是对所有外部符号的间接访问的性能消耗问题,并且也会为了对全局变量的访问以及对函数的引用增加额外的内存。实际存在的问题还会根据编译器的不同、CPU架构的不同以及特定程序的不同而有所差异。

 

Another, less apparent cost, is the increased register usage required to implement PIC. In order to avoid locating the GOT too frequently, it makes sense for the compiler to generate code that keeps its address in a register (usuallyebx). But that ties down a whole register just for the sake of GOT. While not a big problem for RISC architectures that tend to have a lot of general purposes registers, it presents a performance problem for architectures like x86, which has a small amount of registers. PIC means having one general purpose register less, which adds up indirect costs since now more memory references have to be made.

另一个不容易看出的问题就是位置无关代码PIC)增多了对寄存器的使用。 为了减少频繁得对GOT的定位操作,一般的做法是编译器将GOT的地址存入一个寄存器(通常是ebx),可是这就减少了一个寄存器的使用。 当然这个问题对于指令精简集架构(RISC architecture)来说根本不是问题,因为这个架构有很多的通用寄存器,不过在类似X86架构上,这就会导致严重的性能问题,因为这种架构拥有很少的通用寄存器。因此,位置无关代码PIC)意味着更少可用的通用寄存器,间接访问带来的性能消耗(因为有更多的内存引用)。

 

Conclusion

This article explained what position independent code is, and how it helps create shared libraries with shareable read-only text sections. There are some tradeoffs when choosing between PIC and its alternative (load-time relocation), and the eventual outcome really depends on a lot of factors, like the CPU architecture on which the program is going to run.

这篇文章向我们解释了什么是位置无关代码(PIC),以及如何利用它来创建代码段可共享的共享库。在位置无关代码PIC)和装载时重定位之间存在一些取舍,最后绝对使用哪个要考量很多因素,譬如程序运行在其上的CPU架构等。

 

That said, PIC is becoming more and more popular. Some non-Intel architectures like SPARC64 force PIC-only code for shared libraries, and many others (for example, ARM) include IP-relative addressing modes to make PIC more efficient. Both are true for the successor of x86, the x64 architecture. I will discuss PIC on x64 in a future article.

现在,位置无关代码PIC)已得到广泛的应用。在一些非intel架构(例如SPARC64)已经强制使用位置无关代码技术来创建共享库,并且在许多架构上(例如ARM)的IP-relative寻址模式使得PIC性能更高效。对于x86的后继者x64而言,PIC也是正确的选择。

 

The focus of this article, however, has not been on performance considerations or architectural decisions. My aim was to explain, given that PIC is used,how it works. If the explanation wasn’t clear enough – please let me know in the comments and I will try to provide more information.

当然这篇文章的焦点并不是什么性能考虑或者不同架构的选择,我的目的是向大家介绍PIC以及其是如何工作的。

 

 

[1] Unless all applications load this library into the exact same virtual memory address. But this usually isn’t done on Linux.
[2] 0x444 (and all other addresses mentioned in this computation) is relative to the load address of the shared library, which is unknown until an executable actually loads it at runtime. Note how it doesn’t matter in the code since it only juggles relative addresses.
[3] The astute reader may wonder why .got is a separate section at all. Didn’t I just show in the diagrams that it’s located in the data section? In practice, it is. I don’t want to get into the distinction between ELF sections and segments here, since that would take use too far away from the point. But briefly, any number of "data" sections can be defined for a library and mapped into a read-write segment. This doesn’t really matter, as long as the ELF file is organized correctly. Separating the data segment into different logical sections provides modularity and makes the linker’s job easier.
[4] Note that gdb skipped the part where ecx is assigned. That’s because it’s kind-of considered to be part of the function’s prolog (the real reason is in the waygcc structures its debug info, of course). Several references to global data and functions are made inside a function, and a register pointing to GOT can serve all of them.
[5] Shared library ELF objects actually come with special hash table sections for this purpose.
[6] The dynamic loader on Linux is just another shared library which gets loaded into the address space of all running processes.
[7] I placed func in a separate code section, although in theory this could be the same one where the call tofunc is made (i.e. in the same shared library). The "extra credit" section ofthis article has information about why a call to an external function in the same shared library needs PIC (or relocation) as well.
[8] Recall that in the data reference example I promised to explain why there are apparently two GOT sections in the object:.got and.got.plt. Now it should become obvious that this is just to conveniently split the GOT entries required for global data from GOT entries required for the PLT. This is also why when the GOT offset is computed in functions, it points to.got.plt, which comes right after.got. This way, negative offsets lead us to.got, while positive offsets lead us to.got.plt. While convenient, such an arrangement is by no means compulsory. Both parts could be placed into a single.got section.

Related posts:

  1. Position Independent Code (PIC) in shared libraries on x64
  2. Load-time relocation of shared libraries
  3. Understanding the x64 code models
  4. How statically linked programs run on Linux
  5. Shared counter with Python’s multiprocessing

你可能感兴趣的:(Position Independent Code (PIC) in shared libraries)