作者:张华 发表于:2016-03-22
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明
( http://blog.csdn.net/quqi99 )
第一件和kernel相关的work, 在HP DL360p Gen8服务器上运行OpenStack时发生crash。
先排除APIC问题
APIC(Advanced Programmable Interrupt Controller)是用来管理中断的(另一个ACPI是用来管理电源的, Advanced Configuration and Power Interface)。
大家知道,I/O设备可以通过DMA(Direct Memory Access)直接存取内存地址(称为DMA地址,也称为Bus address)。最开始,这个DMA地址都是物理内存地址,并且要求只能是一块连续的地址,而且还要求必须是低位地址,极不灵活,也不能适应虚拟机的需要,所以DMAR(DMA remapping)也就出现了,IOMMU硬件负责操作DMA remapping操作在I/O设备访问的DMA地址翻译成实际的物理内存地址,并做检查访问权限的操作实现隔离。
当CPU访问一个在地址翻译表中不存在的DMA地址时就会触发一个fault,内核会判断这个是合法还是非法地址,如果是合法地址就分配相应的物理内存并建立从物理地址到虚拟地址的翻译项,如果是非法地址就会给进程发个signal产生core dump。所以这些fault有些是recoverable,有些是non-recoverable,所以IOMMU就利用中断的方式呼唤内核(中断服务程序是dmar_fault)。
DMAR的初始化操作是内核使用alloc_iommu函数根据ACPI中的dmar table进行的,每一个表项对应一个dmar设备,名称从dmar0开始依次递增。
每个CPU可能会有自己单独的中断控制器(也叫LOCAL APIC,每个CPU一个,处理LOCAL I/O设备的中断,外部I/O设备的中断也是先经I/O APIC再经LOCAL APIC到达处理器, 也可以不经过I/O APIC直接到LOCAL APIC这叫MSI,跨处理器之间的中断Inter-processor interrupts(IPIs), APIC定时器中断,性能监测计数器的中断,温度传感器中断,APIC内部中断等),也有一个共用的叫I/O APIC(通常一个机器一个),特性处理。
APIC按时期可分为3个版本,APIC, xAPIC与x2APIC。xAPIC中的寄存器是通过内存映射到一段物理地址,在x2APIC模式中,取消了内存映射方式来读取APIC的寄存器,而是采用了MSR的方式。MSR的全写是Model-specific register(每个型号特有的寄存器)这样的好处是不用再担心内存地址的冲突问题。
实际上,内核是从BIOS里读取的dmar table,而HP Gen8的BIOS程序是有问题的,它从APIC里读到的dmar table是坏的(通常在系统日志里会报“Your BIOS is broken; DMAR reported at address”)。
Driver是OS的一部分(运行在主机的CPU),Firmware是设备一部分(设备用的镜像,由内核将fireware加载到内存然后由Driver传送到设备上运行在设备里的CPU上), Driver在初始化时可以调用Fireware的request_firmware等接口。
在BIOS不支持x2apic的时候内核已经开始支持它了,这会造成问题,所以BIOS检查DMAR的x2apic标志退出内核的x2apic模式,内核也要从代码上支持这种退出(git log 41750d3, intremap=no_x2apic_optout表示内核忽略这种退出)。
- <Gen8,硬件不支持x2apic, 内核也就不会用xapic,正确
- >Gen8, 硬件支持x2apic, 内核也会用xapic。
- =Gen8,硬件支持x2apic, 但BIOS没有正确传递dmar table给内核,故内核需做如下设置禁用x2apic,但是Fireware依然使用x2apic,所以内核可能Crash,也可能一些其他奇怪的问题出现。
# let X2APIC enabled with IRQ remapping, so it only differs from xapic in IRQ remapping
"intel_idle.max_cstate=0 intremap=no_x2apic_optout"
OR
# disable X2APIC AND IRQ remapping, will use xapic instead of x2apic
"intel_idle.max_cstate=0 nox2apic intermap=off"
禁用C状态
闲置时保持低电压状态的处理理的闲置状态称为"C状态“,C0是正常状态,级别高于它状态可以省电,但调度到这省电状态的CPU时延迟时间会略微加长。所以对于某些特殊不需要省电的应用可以禁用它(意味着数据中心的温度升高) 。内核包含了针对Intel处理器C状态的驱动程序(称为intel_idel),因为它是内核的,所以即使BIOS设置禁用C状态仍然会生效。所以应该使用核心参数intel_idle.max_cstate=0来禁用intel_idle驱动程序,这样,内核将恢复为由BIOS提供的ACPI表开层C状态控制。
General Protection Fault
Oops显示为general protection fault。内核发生了污染(make Tainted: G D, P代表私有驱动加载,F代表模块强制加载,R代表模块强制卸载,M代表机器检查,B代表检测到错误页,'GD' means that some module was not licensed as GPL and that a crash or BUG() occurred. 具体地见:http://lxr.free-electrons.com/source/Documentation/oops-tracing.txt),所以的进程在35号CPU上被执行,发生问题的代码是(RIP = kmem_cache_alloc_trace+0x5e/0x140 OR RIP = __kmalloc+0x7b/0x190)。
(gdb) list *(__kmalloc+0x7b)
0xffffffff8116611b is in __kmalloc (/build/buildd/linux-3.2.0/mm/slub.c:2325).
2320
* 3. If they were not changed replace tid and freelist
2321
*
2322
* Since this is without lock semantics the protection is only against
2323
* code executing on this cpu *not* from access by other cpus.
2324
*/
2325
if (unlikely(!irqsafe_cpu_cmpxchg_double(
2326
s->cpu_slab->freelist, s->cpu_slab->tid,
2327
object, tid,
2328
get_freepointer_safe(s, object), next_tid(tid)))) {
(gdb) list *(kmem_cache_alloc_trace+0x5e)
0xffffffff8116657e is in kmem_cache_alloc_trace (/build/buildd/linux-3.2.0/mm/slub.c:2325).
2320
* 3. If they were not changed replace tid and freelist
2321
*
2322
* Since this is without lock semantics the protection is only against
2323
* code executing on this cpu *not* from access by other cpus.
2324
*/
2325
if (unlikely(!irqsafe_cpu_cmpxchg_double(
2326
s->cpu_slab->freelist, s->cpu_slab->tid,
2327
object, tid,
2328
get_freepointer_safe(s, object), next_tid(tid)))) {
上面两个地址的输出都指向同一函数irqsafe_cpu_cmpxchg_double, 通过objdump函数查看它的机器码如下:
0xffffffff81166576 <+86>: mov (%r12),%rsi
0xffffffff8116657e <+94>: mov 0x0(%r13,%rax,1),%rbx ### R13 = UPPER HALF OF BASE POINTER
0xffffffff81166583 <+99>: mov %r13,%rax
0xffffffff81166586 <+102>: callq 0xffffffff8131cb20 ### CALL
继续查看函数0xffffffff8131cb20的机器码:
(gdb) x/30i 0xffffffff8131cb20 ### CALLED
0xffffffff8131cb20:
pushfq ### PUSH RFLAGS into stack
0xffffffff8131cb21:
cli ### **** CLEAR INTERRUPT FLAG ****
0xffffffff8131cb22:
cmp %gs:(%rsi),%rax
所以调用顺序应该如下,猜测极有可能是因为在35号CPU上执行local_irq_save时发生了general protection fault。
irqsafe_cpu_cmpxchg_double (#define) ->
irqsafe_generic_cpu_cmpxchg_double (#define) ->
local_irq_save(#define)->...
通过crash找源码
上面是通过gdb找的源码,
直接通过如下的gdb vmlinux找到源码是/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:2412, 想继续看源码需要将源码安装在提示的位置/build/buildd/linux-lts-trusty-3.13.0。
gdb /mnt/ddeb-3.13.0-34.60/usr/lib/debug/boot/vmlinux-3.13.0-45-generic
Reading symbols from /mnt/ddeb-3.13.0-34.60/usr/lib/debug/boot/vmlinux-3.13.0-45-generic...done.
(gdb) list *(kmem_cache_alloc_trace+0x5e)
0xffffffff811af2de is in kmem_cache_alloc_trace (/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:2412).
2407 /build/buildd/linux-lts-trusty-3.13.0/mm/slub.c: No such file or directory
那样太麻烦,我们来看一下如何通过crash找错误的源码:
crash> dis kmem_cache_alloc_trace+0x7c
0xffffffff811af2fc <kmem_cache_alloc_trace+0x7c>: mov 0x0(%r13,%rax,1),%rbx
ubuntu@server-2e5cbffc-ed54-465b-9934-f8681dba8c21:/mnt/73670$ addr2line 0xffffffff811af2fc -e /mnt/ddeb-3.13.0-34.60/usr/lib/debug/boot/vmlinux-3.13.0-45-generic -f -i
get_freepointer
/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:260
get_freepointer_safe
/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:275
slab_alloc_node
/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:2416
slab_alloc
/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:2455
kmem_cache_alloc_trace
/build/buildd/linux-lts-trusty-3.13.0/mm/slub.c:2472
cmpxchg原子操作
看看上面的irqsafe_generic_cpu_cmpxchg_double中的cmpxchg是什么,当多个线程往同一个链表头里插入数据时容易发生同步问题,采用悲观锁的开销太大,可以采用乐观锁重试直至不发生同步为止:
do{
old_head = queue->head;
new_head->next = old_head;
if (old_head == queue->head){
queue->head = new_head;
}
}while(queue->head != new_head)
但问题是上面的第4与第5行无法保证原子操作,CPU的cas/cmpxchg指令正是来干这件事让其原子化。具体参见:http://blog.chinaunix.net/uid-24830931-id-3487817.html
R = cmpxchg(A, C, B)
- Assign A = B if A == C
- Return A at the time of the call, unconditionally
使用cmpxchg指令改写之后的代码如下:
do{
old_head = queue->head;
new_head->next = old_head;
val=cmpxchg(&queue->head, old_head, new_head);
}while(val!=old_head)
cmpxchg_double
根据irqsafe_cpu_cmpxchg_double中的cmpxchg_double搜到(git log)一个bug(git show cdcd62986),可使用参数”intel_idle.max_cstate=0 nox2apic intremap=off”在linux3.8内核中试试。
slab & slub的区别
见:http://events.linuxfoundation.org/images/stories/pdf/klf2012_kim.pdf
slab用于传统机器,slob用于嵌入式系统,slub用于大型系统。
fast-path,从per-cpu的freelist列表上分配叫fast-path
slow-path,从per-cpu的cpu slab(跨cpu了)或partial slabs列表分配叫slow-path
very slow-path, 从per-node(跨numa节点了)上分配或加锁分配叫very slow-path.
this_cpu_cmpxchg_double用于避免中断,cmpxchg_double用于避免用锁。
slub对slab的改进便是:内核的slab_alloc方法就是使用上面的cmpxchg原子操作原理使用乐观锁的方式往slab的freelist里插slub对象(因为执行插这个链表的动态是可能多个CPU在进程调度的)。
排除中断服务程序中由特权级产生的General Protection Fault
如果处理过程将在被中断任务同一个特权级上运行,那么:处理器把EFLAGS、CS和EIP寄存器的当前值保存在当前堆栈上。
如果异常会产生一个错误号,那么该错误号也会被最后压入新栈中。为了从中断处理过程中返回,处理过程必须使用IRET指令。IRET指令与RET指令类似,但IRET还会把保存的寄存器内容恢复到EFLAGS 中。不过只有当CPL是0时才会恢复EFLAGS中的IOPL字段,并且只有当当前特权级CPL不大于IOPL(I/O Privilege Level in FLAGS register)时,IF标志才会被改变。如果当调用中断处理过程时发生了堆栈切换,那么在返回时IRET指令会切换到原来的堆栈。当CPL>IOPL时便会产生General Protection Fault。CS:0010的地址位数显示在内核态,EFLAGS = 00010282 (Decimal) == 0010 1000 0010 1010 (Binary). IOPL = Bits 12 and 13 from EFLAGS = 10. 所以排除这种特级级产生的General Protection Fault
0xffffffff8131cb20: pushfq ### PUSH RFLAGS into stack
0xffffffff8131cb21: cli ### **** CLEAR INTERRUPT FLAG ****
...
Nov 27 18:34:52 sgsxeris001 kernel: [521055.548034] RIP: 0010:[<ffffffff8116616e>] [<ffffffff8116616e>] kmem_cache_alloc_trace+0x5e/0x140
Nov 27 18:34:52 sgsxeris001 kernel: [521055.548191] RSP: 0018:ffff883f6f035d98 EFLAGS: 00010282
...
Nov 27 18:34:52 sgsxeris001 kernel: [521055.548820] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
"per-cpu"内存是否毁坏
内核使用per-cpu为每一个CPU都生成一个变量的副本避免加锁,并充分利用cpu硬件的缓存提升性能。所有CPU的静态per-cpu通过DEFINE_PER_CPU定义在一个特殊的段里。动态per-cpu变量仍然使用页机制。下面显示似乎是free list指针在per-cpu kmem缓存中毁坏了(3.13.0-45-generic有问题,3.13.0-46无问题), 其中kmem的-S参数代表显示slab缓存信息:
crash> kmem -S | grep -i invalid
kmem: invalid kernel virtual address: ffff00088fec6500 type: "get_freepointer"
kmem: invalid kernel virtual address: ffff0008569645a0 type: "get_freepointer"
继续使用内核参数slub_debug=PU,kmalloc-32,apparmor=0检测(CONFIG_DEBUG_SLAB, 为便于调试,在每个对象可以添加SLAB_RED_ZONE,添加这块内存的最后使用者SLAB_STORE_USER,且用SLAB_POISON初始化对象。当slab出错时,在没有redzone时会在系统日志里将缓存名称与对象开始地址打印出来,若有redzone会将缓存名称、对象开始地址、最后用户、对象的内容、对象前后的信息打印出来),它一旦检测到slab错误后,内核将丢弃old slab page并且重新分配新的,用它将更难于产生crash(不用这个参数的话得用到这个非法地址后才会panic, 可以使用slab_debug=PUC参数,C代表内核在下一次alocation/free时检测到它为非法地址就panic),但是可以从它的dmesg信息中找到一些有用的东西进一步分析。
其它类似于slab_debug的内存检测工具有:kmemcheck, kasan(v4.0-rc1)
Kasan是什么
用户空间使用AddressSanitizer (Asan), ThreadSanitizer and MemorySanitizer, 这些工具现在是gcc和clang编译器的一部分,内核空间使用Kasan。
SLUB_DEBUG/DEBUG_SLAB, Can detect some out-of-bounds and use-after-free accesses, Can’t detect out-of-bounds reads, Detects bugs only on allocation / freeing in some cases
DEBUG_PAGEALLOC, Unmaps freed pages from address space, Can detect some use-after-free accesses, Detects use-after-free only when the whole page is unused
kmemcheck, Detects use-after-free accesses and uninitialized-memory-reads, Causes page fault on each memory access (slow)
KASan(Kernel Address Sanitizer, A dynamic memory error detector for the linux kernel), can detect both out-of-bound and use-after-free errors, Based on compiler instrumentation, Detects out-of-bounds for both writes and reads, Has strong use-after-free detection, Detects bugs at the point of occurrence, Prints informative reports, well replace kmemcheck, Asan will unlikely replace debug slab and pagealloc that can be enabled at runtime. kmemcheck couldn't work on several CPUs but KASan doesn't have such limitation.
https://gitlab.com/veo-labs/linux/commit/0b24becc810dc3be6e3f94103a866f214c282394
KSan使用1/8的内核内存用做shadow memory(memory中的每8 bytes在shadow memory都映射了1 byte, bit0-0代表这8个bytes都是合法的,哪个bit为1就代表哪个bit不合法, 不同的负数代表不同的非法内存含义如redzones, freed memory),并使用下列的直接映射方法和memory关联(KASAN_SHADOW_SCALE_SHIFT=3)。
unsigned long kasan_mem_to_shadow(unsigned long addr)
{
return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
}
需要一个特别的编译器将特别的call(__asan_load*(addr, __asan_store*(addr))放在每个以1,2,4,8或16开始的memory前面去检查内存访问是法合法。所以使用它的性能损耗分inline compiler instrumentation and simple linear shadow memory两部分,用户空间的Asan慢2倍左右,内核空间慢10~30%左右。
安装,使用CONFIG_KASAN编译后的内核插入test_kasan模块(sudo insmod /lib/modules/4.4.0-9-generic/kernel/lib/test_kasan.ko)。如何分析输出见:https://www.kernel.org/doc/Documentation/kasan.txt
使用crash分析core dump文件
$ cat crash-start.sh
exec crash --mod /mnt/ddeb-3.13.0-34.60/usr/lib/debug/lib/modules/3.13.0-45-generic/ /mnt/ddeb-3.13.0-34.60/usr/lib/debug/boot/vmlinux-3.13.0-45-generic dump.201502181928
crash> sys |grep KERNEL
KERNEL: /mnt/ddeb-3.13.0-34.60/usr/lib/debug/boot/vmlinux-3.13.0-45-generic
crash> set
PID: 47656
COMMAND: "make"
TASK: ffff880115fa3000 [THREAD_INFO: ffff881f136ec000]
CPU: 14
STATE: TASK_RUNNING (PANIC)
crash> ps -a |grep 47656
PID: 47656 TASK: ffff880115fa3000 CPU: 14 COMMAND: "make"
ps: cannot access user stack address: 7ffff54eead3
ps: no user stack
crash> vtop -c 47656 7ffff54eead3
VIRTUAL PHYSICAL
7ffff54eead3 5d6e73ad3
PML: f82e7c7f8 => 14fb42067
PUD: 14fb42ff8 => fda9b9067
PMD: fda9b9d50 => fdd61d067
PTE: fdd61d770 => 80000005d6e73865
PAGE: 5d6e73000
PTE PHYSICAL FLAGS
80000005d6e73865 5d6e73000 (PRESENT|USER|ACCESSED|DIRTY|NX)
VMA START END FLAGS FILE
ffff880fe7c34240 7ffff54cd000 7ffff54f2000 100173
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea00175b9cc0 5d6e73000 ffff8813d0b6e341 7fffffffb 1 2ffff000008006c referenced,uptodate,lru,active,swapbacked
crash> task |grep sp
sp0 = 0xffff881f136ee000,
sp = 0xffff881f136edf58,
usersp = 0x7ffff54ea970,
crash> rd 0xffff881f136edf58 -e 0xffff881f136ee000
ffff881f136edf58: 00007ffff54ea80c 0000000002474696 ..N......FG.....
ffff881f136edf68: 0000000002474692 00000000026029c0 .FG......)`.....
ffff881f136edf78: 00007ffff54ea8b0 0000000000000000 ..N.............
ffff881f136edf88: 0000000000000246 756265642f343431 F.......144/debu
ffff881f136edf98: 2e322e3170735f33 6e69622f37327970 3_sp1.2.py27/bin
ffff881f136edfa8: 000000000000003b ffffffffffffffff ;...............
ffff881f136edfb8: 00000000026029c0 0000000002603e80 .)`......>`.....
ffff881f136edfc8: 00007ffff54ea80c 000000000000003b ..N.....;.......
ffff881f136edfd8: 00002b8eeccad427 0000000000000033 '....+..3.......
ffff881f136edfe8: 0000000000000246 00007ffff54ea138 F.......8.N.....
ffff881f136edff8: 000000000000002b +.......
crash> files
PID: 47656 TASK: ffff880115fa3000 CPU: 14 COMMAND: "make"
ROOT: / CWD: /...../development@2/secure_vm
FD FILE DENTRY INODE TYPE PATH
0 ffff880fde784d00 ffff880daee9c900 ffff880fd3614470 FIFO
1 ffff881223615400 ffff881293e6c180 ffff881fe5ea7d78 FIFO
2 ffff881223615400 ffff881293e6c180 ffff881fe5ea7d78 FIFO
3 ffff881f13757a00 ffff880191c16d80 ffff880fd3612da0 FIFO
crash> kmem -S |grep invalid
kmem: invalid kernel virtual address: ffff00088fec6500 type: "get_freepointer"
kmem: invalid kernel virtual address: ffff00088fec6500 type: "get_freepointer"
crash> struct kmem_cache
struct kmem_cache {
struct kmem_cache_cpu *cpu_slab;
unsigned long flags;
unsigned long min_partial;
int size;
int object_size;
int offset;
int cpu_partial;
struct kmem_cache_order_objects oo;
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags;
int refcount;
void (*ctor)(void *);
int inuse;
int align;
int reserved;
const char *name;
struct list_head list;
struct kobject kobj;
struct memcg_cache_params *memcg_params;
int max_attr_size;
int remote_node_defrag_ratio;
struct kmem_cache_node *node[64];
}
SIZE: 0x2c8
crash> struct kmem_cache.offset
struct kmem_cache {
[0x20] int offset;
}
# disable apparmor, GRUB_CMDLINE_LINUX_DEFAULT="quiet splash apparmor=0"
crash> bt -l
PID: 47656 TASK: ffff880115fa3000 CPU: 14 COMMAND: "make"
#0 [ffff881f136ed960] machine_kexec at ffffffff8104d9a1
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/machine_kexec_64.c: 266
#1 [ffff881f136ed9d0] crash_kexec at ffffffff810ec218
/build/buildd/linux-lts-trusty-3.13.0/kernel/kexec.c: 1099
#2 [ffff881f136edaa0] oops_end at ffffffff81765ec8
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/dumpstack.c: 230
#3 [ffff881f136edad0] die at ffffffff81018648
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/dumpstack.c: 310
#4 [ffff881f136edb00] do_general_protection at ffffffff817657c0
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/traps.c: 304
#5 [ffff881f136edb30] general_protection at ffffffff817650c8
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/entry_64.S: 1514
[exception RIP: kmem_cache_alloc_trace+0x7c]
RIP: ffffffff811af2fc RSP: ffff881f136edbe8 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff880fff410830 RCX: 0000000000b735a1
RDX: 0000000000b735a0 RSI: 00000000000080d0 RDI: 0000000000016240
RBP: ffff881f136edc38 R8: ffff88203f896240 R9: ffffffff8132906a
R10: 8080808080808080 R11: 0000000000000000 R12: ffff880fff403c00
R13: ffff00088fec6500 R14: 00000000000080d0 R15: ffff880fff403c00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff881f136edc40] apparmor_file_alloc_security at ffffffff8132906a
/build/buildd/linux-lts-trusty-3.13.0/security/apparmor/include/file.h: 61
#7 [ffff881f136edc70] security_file_alloc at ffffffff812e9c56
/build/buildd/linux-lts-trusty-3.13.0/security/security.c: 691
#8 [ffff881f136edc80] get_empty_filp at ffffffff811cc8c0
/build/buildd/linux-lts-trusty-3.13.0/fs/file_table.c: 130
#9 [ffff881f136edcb0] path_openat at ffffffff811da2b8
/build/buildd/linux-lts-trusty-3.13.0/fs/namei.c: 3169
#10 [ffff881f136edd70] do_filp_open at ffffffff811db5c3
/build/buildd/linux-lts-trusty-3.13.0/fs/namei.c: 3238
#11 [ffff881f136ede40] open_exec at ffffffff811d1ec5
/build/buildd/linux-lts-trusty-3.13.0/fs/exec.c: 767
#12 [ffff881f136edea0] do_execve_common at ffffffff811d242b
/build/buildd/linux-lts-trusty-3.13.0/fs/exec.c: 1500
#13 [ffff881f136edf10] do_execve at ffffffff811d26b8
/build/buildd/linux-lts-trusty-3.13.0/fs/exec.c: 1586
#14 [ffff881f136edf20] sys_execve at ffffffff811d293d
/build/buildd/linux-lts-trusty-3.13.0/fs/exec.c: 1688
#15 [ffff881f136edf50] stub_execve at ffffffff8176ded9
/build/buildd/linux-lts-trusty-3.13.0/arch/x86/kernel/entry_64.S: 870
RIP: 00002b8eeccad427 RSP: 00007ffff54ea138 RFLAGS: 00000246
RAX: 000000000000003b RBX: 0000000000000000 RCX: ffffffffffffffff
RDX: 00000000026029c0 RSI: 0000000002603e80 RDI: 00007ffff54ea80c
RBP: 00007ffff54ea8b0 R8: 6e69622f37327970 R9: 2e322e3170735f33
R10: 756265642f343431 R11: 0000000000000246 R12: 00000000026029c0
R13: 0000000002474692 R14: 0000000002474696 R15: 00007ffff54ea80c
ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
crash> dis kmem_cache_alloc_trace+0x7c
0xffffffff811af2fc <kmem_cache_alloc_trace+0x7c>: mov 0x0(%r13,%rax,1),%rbx
crash> dis kmem_cache_alloc_trace
0xffffffff811af280 <kmem_cache_alloc_trace>: nopl 0x0(%rax,%rax,1)
0xffffffff811af285 <kmem_cache_alloc_trace+0x5>: push %rbp
0xffffffff811af286 <kmem_cache_alloc_trace+0x6>: mov %rsp,%rbp
0xffffffff811af289 <kmem_cache_alloc_trace+0x9>: push %r15
0xffffffff811af28b <kmem_cache_alloc_trace+0xb>: mov %rdi,%r15
0xffffffff811af28e <kmem_cache_alloc_trace+0xe>: push %r14
0xffffffff811af290 <kmem_cache_alloc_trace+0x10>: mov %esi,%r14d
0xffffffff811af293 <kmem_cache_alloc_trace+0x13>: push %r13
0xffffffff811af295 <kmem_cache_alloc_trace+0x15>: push %r12
0xffffffff811af297 <kmem_cache_alloc_trace+0x17>: push %rbx
0xffffffff811af298 <kmem_cache_alloc_trace+0x18>: sub $0x28,%rsp
0xffffffff811af29c <kmem_cache_alloc_trace+0x1c>: mov 0xb65ece(%rip),%eax # 0xffffffff81d15170 <gfp_allowed_mask>
0xffffffff811af2a2 <kmem_cache_alloc_trace+0x22>: mov %rdx,-0x40(%rbp)
0xffffffff811af2a6 <kmem_cache_alloc_trace+0x26>: mov 0x8(%rbp),%r9
0xffffffff811af2aa <kmem_cache_alloc_trace+0x2a>: and %esi,%eax
0xffffffff811af2ac <kmem_cache_alloc_trace+0x2c>: test $0x10,%al
0xffffffff811af2ae <kmem_cache_alloc_trace+0x2e>: je 0xffffffff811af2bd <kmem_cache_alloc_trace+0x3d>
0xffffffff811af2b0 <kmem_cache_alloc_trace+0x30>: mov %r9,-0x50(%rbp)
0xffffffff811af2b4 <kmem_cache_alloc_trace+0x34>: callq 0xffffffff81760d30 <_cond_resched>
0xffffffff811af2b9 <kmem_cache_alloc_trace+0x39>: mov -0x50(%rbp),%r9
0xffffffff811af2bd <kmem_cache_alloc_trace+0x3d>: nopl 0x0(%rax,%rax,1)
0xffffffff811af2c2 <kmem_cache_alloc_trace+0x42>: mov %r15,%r12
0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>: mov (%r12),%r8
0xffffffff811af2c9 <kmem_cache_alloc_trace+0x49>: add %gs:0xcce8,%r8
0xffffffff811af2d2 <kmem_cache_alloc_trace+0x52>: mov 0x8(%r8),%rdx
0xffffffff811af2d6 <kmem_cache_alloc_trace+0x56>: mov (%r8),%r13
0xffffffff811af2d9 <kmem_cache_alloc_trace+0x59>: mov 0x10(%r8),%rax
0xffffffff811af2dd <kmem_cache_alloc_trace+0x5d>: test %r13,%r13
0xffffffff811af2e0 <kmem_cache_alloc_trace+0x60>: je 0xffffffff811af440 <kmem_cache_alloc_trace+0x1c0>
0xffffffff811af2e6 <kmem_cache_alloc_trace+0x66>: test %rax,%rax
0xffffffff811af2e9 <kmem_cache_alloc_trace+0x69>: je 0xffffffff811af440 <kmem_cache_alloc_trace+0x1c0>
0xffffffff811af2ef <kmem_cache_alloc_trace+0x6f>: movslq 0x20(%r12),%rax
0xffffffff811af2f4 <kmem_cache_alloc_trace+0x74>: mov (%r12),%rdi
0xffffffff811af2f8 <kmem_cache_alloc_trace+0x78>: lea 0x1(%rdx),%rcx
0xffffffff811af2fc <kmem_cache_alloc_trace+0x7c>: mov 0x0(%r13,%rax,1),%rbx
0xffffffff811af301 <kmem_cache_alloc_trace+0x81>: mov %r13,%rax
0xffffffff811af304 <kmem_cache_alloc_trace+0x84>: cmpxchg16b %gs:(%rdi)
0xffffffff811af309 <kmem_cache_alloc_trace+0x89>: sete %al
0xffffffff811af30c <kmem_cache_alloc_trace+0x8c>: test %al,%al
0xffffffff811af30e <kmem_cache_alloc_trace+0x8e>: je 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af310 <kmem_cache_alloc_trace+0x90>: movslq 0x20(%r12),%rax
0xffffffff811af315 <kmem_cache_alloc_trace+0x95>: prefetcht0 (%rbx,%rax,1)
0xffffffff811af319 <kmem_cache_alloc_trace+0x99>: test %r13,%r13
0xffffffff811af31c <kmem_cache_alloc_trace+0x9c>: jne 0xffffffff811af418 <kmem_cache_alloc_trace+0x198>
0xffffffff811af322 <kmem_cache_alloc_trace+0xa2>: movslq 0x18(%r15),%r15
0xffffffff811af326 <kmem_cache_alloc_trace+0xa6>: mov 0x8(%rbp),%rax
0xffffffff811af32a <kmem_cache_alloc_trace+0xaa>: mov %rax,-0x48(%rbp)
0xffffffff811af32e <kmem_cache_alloc_trace+0xae>: mov %r15,-0x38(%rbp)
0xffffffff811af332 <kmem_cache_alloc_trace+0xb2>: nopl 0x0(%rax,%rax,1)
0xffffffff811af337 <kmem_cache_alloc_trace+0xb7>: add $0x28,%rsp
0xffffffff811af33b <kmem_cache_alloc_trace+0xbb>: mov %r13,%rax
0xffffffff811af33e <kmem_cache_alloc_trace+0xbe>: pop %rbx
0xffffffff811af33f <kmem_cache_alloc_trace+0xbf>: pop %r12
0xffffffff811af341 <kmem_cache_alloc_trace+0xc1>: pop %r13
0xffffffff811af343 <kmem_cache_alloc_trace+0xc3>: pop %r14
0xffffffff811af345 <kmem_cache_alloc_trace+0xc5>: pop %r15
0xffffffff811af347 <kmem_cache_alloc_trace+0xc7>: pop %rbp
0xffffffff811af348 <kmem_cache_alloc_trace+0xc8>: retq
0xffffffff811af349 <kmem_cache_alloc_trace+0xc9>: nopl 0x0(%rax)
0xffffffff811af350 <kmem_cache_alloc_trace+0xd0>: test $0x800,%r14d
0xffffffff811af357 <kmem_cache_alloc_trace+0xd7>: mov %r15,%r12
0xffffffff811af35a <kmem_cache_alloc_trace+0xda>: jne 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af360 <kmem_cache_alloc_trace+0xe0>: mov %gs:0xb7e0,%eax
0xffffffff811af368 <kmem_cache_alloc_trace+0xe8>: test $0x1fff00,%eax
0xffffffff811af36d <kmem_cache_alloc_trace+0xed>: jne 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af373 <kmem_cache_alloc_trace+0xf3>: mov %gs:0xb800,%rax
0xffffffff811af37c <kmem_cache_alloc_trace+0xfc>: cmpq $0x0,0x2a8(%rax)
0xffffffff811af384 <kmem_cache_alloc_trace+0x104>: je 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af38a <kmem_cache_alloc_trace+0x10a>: testb $0x20,0x16(%rax)
0xffffffff811af38e <kmem_cache_alloc_trace+0x10e>: jne 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af394 <kmem_cache_alloc_trace+0x114>: mov 0x8(%rax),%rdx
0xffffffff811af398 <kmem_cache_alloc_trace+0x118>: mov 0x10(%rdx),%rdx
0xffffffff811af39c <kmem_cache_alloc_trace+0x11c>: and $0x4,%edx
0xffffffff811af39f <kmem_cache_alloc_trace+0x11f>: jne 0xffffffff811af45b <kmem_cache_alloc_trace+0x1db>
0xffffffff811af3a5 <kmem_cache_alloc_trace+0x125>: mov %r14d,%esi
0xffffffff811af3a8 <kmem_cache_alloc_trace+0x128>: mov %r15,%rdi
0xffffffff811af3ab <kmem_cache_alloc_trace+0x12b>: mov %r9,-0x50(%rbp)
0xffffffff811af3af <kmem_cache_alloc_trace+0x12f>: callq 0xffffffff811bc810 <__memcg_kmem_get_cache>
0xffffffff811af3b4 <kmem_cache_alloc_trace+0x134>: mov -0x50(%rbp),%r9
0xffffffff811af3b8 <kmem_cache_alloc_trace+0x138>: mov %rax,%r12
0xffffffff811af3bb <kmem_cache_alloc_trace+0x13b>: jmpq 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af3c0 <kmem_cache_alloc_trace+0x140>: mov 0xb43989(%rip),%rbx # 0xffffffff81cf2d50 <__tracepoint_kmalloc+0x30>
0xffffffff811af3c7 <kmem_cache_alloc_trace+0x147>: test %rbx,%rbx
0xffffffff811af3ca <kmem_cache_alloc_trace+0x14a>: je 0xffffffff811af40d <kmem_cache_alloc_trace+0x18d>
0xffffffff811af3cc <kmem_cache_alloc_trace+0x14c>: mov (%rbx),%rax
0xffffffff811af3cf <kmem_cache_alloc_trace+0x14f>: lea 0x10(%rbx),%r15
0xffffffff811af3d3 <kmem_cache_alloc_trace+0x153>: mov %rbx,%r12
0xffffffff811af3d6 <kmem_cache_alloc_trace+0x156>: nopw %cs:0x0(%rax,%rax,1)
0xffffffff811af3e0 <kmem_cache_alloc_trace+0x160>: mov 0x8(%r12),%rdi
0xffffffff811af3e5 <kmem_cache_alloc_trace+0x165>: add $0x10,%r12
0xffffffff811af3e9 <kmem_cache_alloc_trace+0x169>: mov %r14d,%r9d
0xffffffff811af3ec <kmem_cache_alloc_trace+0x16c>: mov -0x38(%rbp),%r8
0xffffffff811af3f0 <kmem_cache_alloc_trace+0x170>: mov -0x40(%rbp),%rcx
0xffffffff811af3f4 <kmem_cache_alloc_trace+0x174>: mov %r13,%rdx
0xffffffff811af3f7 <kmem_cache_alloc_trace+0x177>: mov -0x48(%rbp),%rsi
0xffffffff811af3fb <kmem_cache_alloc_trace+0x17b>: callq *%rax
0xffffffff811af3fd <kmem_cache_alloc_trace+0x17d>: mov %r12,%rax
0xffffffff811af400 <kmem_cache_alloc_trace+0x180>: sub %rbx,%rax
0xffffffff811af403 <kmem_cache_alloc_trace+0x183>: mov -0x10(%r15,%rax,1),%rax
0xffffffff811af408 <kmem_cache_alloc_trace+0x188>: test %rax,%rax
0xffffffff811af40b <kmem_cache_alloc_trace+0x18b>: jne 0xffffffff811af3e0 <kmem_cache_alloc_trace+0x160>
0xffffffff811af40d <kmem_cache_alloc_trace+0x18d>: jmpq 0xffffffff811af337 <kmem_cache_alloc_trace+0xb7>
0xffffffff811af412 <kmem_cache_alloc_trace+0x192>: nopw 0x0(%rax,%rax,1)
0xffffffff811af418 <kmem_cache_alloc_trace+0x198>: test $0x8000,%r14d
0xffffffff811af41f <kmem_cache_alloc_trace+0x19f>: je 0xffffffff811af322 <kmem_cache_alloc_trace+0xa2>
0xffffffff811af425 <kmem_cache_alloc_trace+0x1a5>: movslq 0x1c(%r12),%rdx
0xffffffff811af42a <kmem_cache_alloc_trace+0x1aa>: xor %esi,%esi
0xffffffff811af42c <kmem_cache_alloc_trace+0x1ac>: mov %r13,%rdi
0xffffffff811af42f <kmem_cache_alloc_trace+0x1af>: callq 0xffffffff81387bf0 <__memset>
0xffffffff811af434 <kmem_cache_alloc_trace+0x1b4>: jmpq 0xffffffff811af322 <kmem_cache_alloc_trace+0xa2>
0xffffffff811af439 <kmem_cache_alloc_trace+0x1b9>: nopl 0x0(%rax)
0xffffffff811af440 <kmem_cache_alloc_trace+0x1c0>: mov %r9,%rcx
0xffffffff811af443 <kmem_cache_alloc_trace+0x1c3>: mov $0xffffffff,%edx
0xffffffff811af448 <kmem_cache_alloc_trace+0x1c8>: mov %r14d,%esi
0xffffffff811af44b <kmem_cache_alloc_trace+0x1cb>: mov %r12,%rdi
0xffffffff811af44e <kmem_cache_alloc_trace+0x1ce>: callq 0xffffffff81750117 <__slab_alloc>
0xffffffff811af453 <kmem_cache_alloc_trace+0x1d3>: mov %rax,%r13
0xffffffff811af456 <kmem_cache_alloc_trace+0x1d6>: jmpq 0xffffffff811af319 <kmem_cache_alloc_trace+0x99>
0xffffffff811af45b <kmem_cache_alloc_trace+0x1db>: testb $0x1,0x5f1(%rax)
0xffffffff811af462 <kmem_cache_alloc_trace+0x1e2>: je 0xffffffff811af3a5 <kmem_cache_alloc_trace+0x125>
0xffffffff811af468 <kmem_cache_alloc_trace+0x1e8>: jmpq 0xffffffff811af2c5 <kmem_cache_alloc_trace+0x45>
0xffffffff811af46d <kmem_cache_alloc_trace+0x1ed>: nopl (%rax)
slab_alloc的代码
/*
* Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
* have the fastpath folded into their functions. So no function call
* overhead for requests that can be satisfied on the fastpath.
*
* The fastpath works by first checking if the lockless freelist can be used.
* If not then __slab_alloc is called for slow processing.
*
* Otherwise we can simply pick the next object from the lockless free list.
*/
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr)
{
void **object;
struct kmem_cache_cpu *c;
struct page *page;
unsigned long tid;
if (slab_pre_alloc_hook(s, gfpflags))
return NULL;
s = memcg_kmem_get_cache(s, gfpflags);
redo:
/*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
* enabled. We may switch back and forth between cpus while
* reading from one cpu area. That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg.
*
* Preemption is disabled for the retrieval of the tid because that
* must occur from the current processor. We cannot allow rescheduling
* on a different processor between the determination of the pointer
* and the retrieval of the tid.
*/
preempt_disable();
c = __this_cpu_ptr(s->cpu_slab);
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue. Thus they can be guarantee that the cmpxchg_double
* occurs on the right processor and that there was no operation on the
* linked list in between.
*/
tid = c->tid;
preempt_enable();
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
void *next_object = get_freepointer_safe(s, object); #但是这句是在对应numa节点之下per-cpu数据,不会和下面的this_cpu_cmpxchg_double产生冲突
/*
* The cmpxchg will only match if there was no additional
* operation and if we are on the right processor.
*
* The cmpxchg does the following atomically (without lock
* semantics!)
* 1. Relocate first pointer to the current per cpu area.
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
* Since this is without lock semantics the protection is only
* against code executing on this cpu *not* from access by
* other cpus.
*/
if (unlikely(!this_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
next_object, next_tid(tid)))) {
note_cmpxchg_failure("slab_alloc", s, tid);
goto redo;
}
prefetch_freepointer(s, next_object);
stat(s, ALLOC_FASTPATH);
}
if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->object_size);
slab_post_alloc_hook(s, gfpflags, object);
return object;
}
代码解释见:http://blog.chinaunix.net/uid-26859697-id-5498373.html
this_cpu_cmpxchg_double()原子指令操作。该原子操作主要做了三件事情:1)重定向首指针指向当前CPU的空间;2)判断tid和freelist未被修改;3)如果未被修改,也就是相等,确信此次slab分配未被CPU迁移,接着将新的tid和freelist数据覆盖过去以更新。
具体将this_cpu_cmpxchg_double()的功能展开用C语言表述就是:
if ((__this_cpu_ptr(s->cpu_slab->freelist) == object) && (__this_cpu_ptr(s->cpu_slab->tid) == tid))
{
__this_cpu_ptr(s->cpu_slab->freelist) = next_object;
__this_cpu_ptr(s->cpu_slab->tid) = next_tid(tid);
return true;
}
else
{
return false;
}
CONFIG_DEBUG_PAGEALLOC参数
下面发现错误的get_freepointer函数里有CONFIG_DEBUG_PAGEALLOC(debug_pagealloc=1),我们来研究一下这个参数CONFIG_DEBUG_PAGEALLOC。
static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
{
void *p;
#ifdef CONFIG_DEBUG_PAGEALLOC
probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
#else
p = get_freepointer(s, object);
#endif
return p;
}
见:https://lwn.net/Articles/321595/, 它不能检测read非法内存,只能检测write非法内存,并且要好长一段时间写到那里才发生。
per-cpu数据的内核抢占问题
同步有两层:一层是不同cpu之间的,第二层是相同cpu之间的(虽然get_freepointer_safe作为per-cpu的数据无论什么时候都不会被其他cpu执行,但是this_cpu_cmpxchg_double作为乐观锁是有可能不执行freelist/tid更新的(进程调度到
同一CPU上执行了slab对象的free操作),这时候乐观执行 get_freepointer_safe就会访问错误内存地址了)。第一种不同cpu之间的情形已经采用per-cpu机制自动做了,但是第二种相同cpu之间在不禁止进程抢占的情形下仍然是有可能发生同步问题的,那内核是怎么做到的呢?
static __always_inline void slab_free(struct kmem_cache *s,
struct page *page, void *x, unsigned long addr){
...
redo:
preempt_disable();
c = __this_cpu_ptr(s->cpu_slab);
tid = c->tid;
preempt_enable();
if (likely(page == c->page)) {
set_freepointer(s, object, c->freelist);
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr){
...
redo:
preempt_disable();
c = __this_cpu_ptr(s->cpu_slab);
tid = c->tid;
preempt_enable();
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
void *next_object = get_freepointer_safe(s, object);
slab_alloc_node与slab_free获取freelist是禁止抢占了的,
是成对出现的,应该内核没同步这方面的问题
下一步
内核排查似乎没有问题,有可能是应用层使用slab时先free再引用出的问题,但仅从crash无法找到是哪个应用造成的,还得结合kasan来继续追查(DEBUG_PAGELLOC这些参数对read无效,仅对write有效,所以真正crash时离发生内存错误的时间较远,参考价值不大),
kasan更加实时。