GitHub - aliyun/surftrace: surftrace is a tool that allows you to surf the linux kernelsurftrace is a tool that allows you to surf the linux kernel - GitHub - aliyun/surftrace: surftrace is a tool that allows you to surf the linux kernelhttps://github.com/aliyun/surftrace
surftrace是在ftrace和libbpf基础上封装的一系列工具集,用于trace内核信息,当前发行版主要包含 surftrace、surfGuide、pylcc三大个工具
ftrace是一个内核中的追踪器,用于帮助系统开发者或设计者查看内核运行情况,它可以被用来调试或者分析延迟/性能等常见问题。早期 ftrace 是一个 function tracer,仅能够记录内核的函数调用流程。如今ftrace已经成为一个开发框架,从2.6内核开始引入,是一套公认安全、可靠、高效的内核数据获取方式。
ftrace对使用者的要求比较高,以对内核符号 wake_up_new_task 进行trace,同时要获取入参(struct task_struct *)->comm 成员信息为例,启动配置需要经历三个步骤:
echo 'p:f0 wake_up_new_task comm=+0x678(%di):string' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
要想停止需要继续配置如下:
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo -:f0 >> /sys/kernel/debug/tracing/kprobe_events echo 0 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
一共需要六个步骤。其中,最困难的是第一个参数解析步骤。通常情况下,需要使用gdb 加载对应内核vmlinux, 对 struct task_struct 结构体中 comm成员进行偏移计算。上述方法如果不经常使用,重新手工操作的时间成本非常高,导致真正直接采用ftrace对内核信息进行采集的案例非常少,相关资料文献也匮乏。
surftrace的主要目标是为了降低内核trace难度,达到快速高效获取内核信息目标。综合来说要达到以下效果:
使用surftrace,需要满足以下条件:
surftrace支持 remote(默认),local和gdb三种表达式解析器,要求分别如下:
我们以龙蜥 4.19.91-24.8.an8.x86_64内核为例,需要root用户,执行以下命令进行安装:
pip3 install surftrace
Collecting surftrace
Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/b9/a2/f7e04bb8ebb12e6517162a70886e3ffe8d466437b15624590c9301fdcc52/surftrace-0.2.tar.gz
Building wheels for collected packages: surftrace
Running setup.py bdist_wheel for surftrace ... done
Stored in directory: /root/.cache/pip/wheels/cf/28/93/187f359be189bf0bf4a70197c53519c6ca54ffb957bcbebf5a
Successfully built surftrace
Installing collected packages: surftrace
Successfully installed surftrace-0.2
检查安装是否成功
surftrace --help
usage: surftrace [-h] [-v VMLINUX] [-m MODE] [-d DB] [-r RIP] [-f FILE]
[-g GDB] [-F FUNC] [-o OUTPUT] [-l LINE] [-a ARCH] [-s] [-S]
[traces [traces ...]]
Trace ftrace kprobe events.
positional arguments:
traces set trace args.
optional arguments:
-h, --help show this help message and exit
-v VMLINUX, --vmlinux VMLINUX
set vmlinux path.
-m MODE, --mode MODE set arg parser, fro
-d DB, --db DB set local db path.
-r RIP, --rip RIP set remote server ip, remote mode only.
-f FILE, --file FILE set input args path.
-g GDB, --gdb GDB set gdb exe file path.
-F FUNC, --func FUNC disasassemble function.
-o OUTPUT, --output OUTPUT
set output bash file
-l LINE, --line LINE get file disasemble info
-a ARCH, --arch ARCH set architecture.
-s, --stack show call stacks.
-S, --show only show expressions.
examples:
接下来我们以 以下两个常用内核符号为例,它的原型定义如下:
void wake_up_new_task(struct task_struct *p); struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op);
surftrace 'p wake_up_new_task' 'r wake_up_new_task' echo 'p:f0 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 'r:f1 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f1/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on surftrace-2336 [001] .... 1447.877666: f0: (wake_up_new_task+0x0/0x280) surftrace-2336 [001] d... 1447.877670: f1: (_do_fork+0x153/0x3d0 <- wake_up_new_task)
示例中入参有两个表达式,所有表达式要用单引号括起来。
后面的wake_up_new_task是要trace的函数符号,这个符号必须要在tracing/available_filter_functions 中可以找到的。
要获取 do_filp_open 函数 第一个入参dfd,它的数据类型是:int。
- 命令:surftrace 'p do_filp_open dfd=%0'
surftrace 'p do_filp_open dfd=%0' echo 'p:f0 do_filp_open dfd=%di:u32' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on surftrace-2435 [001] .... 2717.606277: f0: (do_filp_open+0x0/0x100) dfd=4294967196 AliYunDun-1812 [000] .... 2717.655955: f0: (do_filp_open+0x0/0x100) dfd=4294967196 AliYunDun-1812 [000] .... 2717.856227: f0: (do_filp_open+0x0/0x100) dfd=4294967196
前面打印中,dfd是按照十进制显示的,可能没有十六进制那么直观,指定十六进制的方法:
命令:surftrace 'p do_filp_open dfd=X%0'
surftrace 'p do_filp_open dfd=X%0'
echo 'p:f0 do_filp_open dfd=%di:x32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2459 [000] .... 3137.167885: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
AliYunDun-1812 [001] .... 3137.171997: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
AliYunDun-1826 [001] .... 3137.201401: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
传参编号%前面使用了X进制类型标识符,共有SUX三种类型,分别对应有符号十进制、无符号十进制和十六进制,不指定默认为U类型。
wake_up_new_task入参类型为struct task_struct *,如果要获取入参中comm成员,即任务名,
- 命令:surftrace 'p wake_up_new_task comm=%0->comm'
surftrace 'p wake_up_new_task comm=%0->comm' echo 'p:f0 wake_up_new_task comm=+0xae0(%di):string' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on surftrace-2421 [000] .... 2368.261019: f0: (wake_up_new_task+0x0/0x280) comm="surftrace" bash-2392 [001] .... 2375.809655: f0: (wake_up_new_task+0x0/0x280) comm="bash" bash-2392 [001] .... 2379.038534: f0: (wake_up_new_task+0x0/0x280) comm="bash" bash-2392 [000] .... 2381.237443: f0: (wake_up_new_task+0x0/0x280) comm="bash"
方法和C语言获取结构体成员方法一样。
结构体类型可以级联访问:
surftrace 'p wake_up_new_task uesrs=S%0->mm->mm_users' echo 'p:f0 wake_up_new_task uesrs=+0x58(+0x850(%di)):s32' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on surftrace-2471 [001] .... 3965.234680: f0: (wake_up_new_task+0x0/0x280) uesrs=2 bash-2392 [000] .... 3970.094475: f0: (wake_up_new_task+0x0/0x280) uesrs=1 bash-2392 [000] .... 3971.954463: f0: (wake_up_new_task+0x0/0x280) uesrs=1
surftrace 'p wake_up_new_task node=%0->se.run_node.rb_left' echo 'p:f0 wake_up_new_task node=+0xa8(%di):u64' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on surftrace-2543 [001] .... 5926.605145: f0: (wake_up_new_task+0x0/0x280) node=0 bash-2392 [001] .... 5940.292293: f0: (wake_up_new_task+0x0/0x280) node=0 bash-2392 [001] .... 5945.207106: f0: (wake_up_new_task+0x0/0x280) node=0 systemd-journal-553 [000] .... 5953.211998: f0: (wake_up_new_task+0x0/0x280) node=0
过滤器需要放在表达式最后,以f:开头,可以使用括号和&& ||逻辑表达式进行组合,具体写法可以参考ftrace文档说明
命令行 surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"'
surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"' echo 'p:f0 wake_up_new_task comm=+0xb28(%di):string' >> /sys/kernel/debug/tracing/kprobe_events echo 'comm=="python3"' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-2640781 [002] .... 6305734.444913: f0: (wake_up_new_task+0x0/0x250) comm="python3" <...>-2640781 [002] .... 6305734.447806: f0: (wake_up_new_task+0x0/0x250) comm="python3" <...>-2640781 [002] .... 6305734.450897: f0: (wake_up_new_task+0x0/0x250) comm="python3"
系统会默认提供 'common_pid', 'common_preempt_count', 'common_flags', 'common_type' 这5个变量作为过滤器,该变量由系统提供,无需额外定义。
函数内部追踪需要结合函数内部汇编代码进行推导,该方法并不通用,该内容操作进供参考。反汇编do_filp_open函数
3699 in fs/namei.c 0xffffffff812adb65 <+85>: mov %r13d,%edx 0xffffffff812adb70 <+96>: or $0x40,%edx 0xffffffff812adb73 <+99>: mov %r12,%rsi 0xffffffff812adb76 <+102>: mov %rsp,%rdi 0xffffffff812adb89 <+121>: callq 0xffffffff812ac7600xffffffff812adb92 <+130>: mov %rax,%rbx 3700 in fs/namei.c 0xffffffff812adb8e <+126>: cmp $0xfffffffffffffff6,%rax 0xffffffff812adb95 <+133>: je 0xffffffff812adbb4 3701 in fs/namei.c 0xffffffff812adbb4 <+164>: mov %r13d,%edx 0xffffffff812adbb7 <+167>: mov %r12,%rsi 0xffffffff812adbba <+170>: mov %rsp,%rdi 0xffffffff812adbbd <+173>: callq 0xffffffff812ac760 0xffffffff812adbc2 <+178>: mov %rax,%rbx 0xffffffff812adbc5 <+181>: jmp 0xffffffff812adb97 3702 in fs/namei.c 0xffffffff812adb97 <+135>: cmp $0xffffffffffffff8c,%rbx 0xffffffff812adb9b <+139>: je 0xffffffff812adbc7
对应源码
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) { struct nameidata nd; int flags = op->lookup_flags; struct file *filp; set_nameidata(&nd, dfd, pathname); filp = path_openat(&nd, op, flags | LOOKUP_RCU); if (unlikely(filp == ERR_PTR(-ECHILD))) filp = path_openat(&nd, op, flags); if (unlikely(filp == ERR_PTR(-ESTALE))) filp = path_openat(&nd, op, flags | LOOKUP_REVAL); restore_nameidata(); return filp; }
要获取 3699行 filp = path_openat(&nd, op, flags | LOOKUP_RCU) 对应的filp的值
surftrace 'p do_filp_open+121 filp=X!(u64)%ax' echo 'p:f0 do_filp_open+121 filp=%ax:x64' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-1315799 [006] d.Z. 6314249.201847: f0: (do_filp_open+0x79/0xd0) filp=0xffff929db2819840 <...>-4006158 [014] d.Z. 6314249.326736: f0: (do_filp_open+0x79/0xd0) filp=0xffff929daeac48c0
变量表达式:filp=X!(u64)%ax 中,使用!对寄存器类型进行数据类型强制转换,括号当中的是是数据类型定义。
展开 struct file 结构体定义:
struct file { union { struct llist_node fu_llist; struct callback_head fu_rcuhead; } f_u; struct path f_path; struct inode *f_inode; const struct file_operations *f_op; spinlock_t f_lock; enum rw_hint f_write_hint; atomic_long_t f_count; unsigned int f_flags; fmode_t f_mode; struct mutex f_pos_lock; loff_t f_pos; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; u64 f_version; void *f_security; void *private_data; struct list_head f_ep_links; struct list_head f_tfile_llink; struct address_space *f_mapping; errseq_t f_wb_err; }
如果要获取此时的f_pos值,可以这样获取
surftrace 'p do_filp_open+121 pos=X!(struct file*)%ax->f_pos' echo 'p:f0 do_filp_open+121 pos=+0x68(%ax):x64' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-1334277 [010] d.Z. 6314645.646230: f0: (do_filp_open+0x79/0xd0) pos=0x0 <...>-2916553 [002] d.Z. 6314645.653164: f0: (do_filp_open+0x79/0xd0) pos=0x0 <...>-2916553 [002] d.Z. 6314645.653253: f0: (do_filp_open+0x79/0xd0) pos=0x0
获取方法和前面保持一致。
前文已经描述采用r 对事件类型进行标识,返回寄存器统一用$retval标识,与ftrace保持一致,以获取do_filp_open函数返回值为例:
surftrace 'r do_filp_open filp=$retval' echo 'r:f0 do_filp_open filp=$retval:u64' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-1362926 [010] d... 6315264.198718: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804769722880 <...>-4006154 [008] d... 6315264.256749: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770426624 <...>-4006154 [008] d... 6315264.256776: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770425344
获取 struct file 中f_pos成员
surftrace 'r do_filp_open pos=$retval->f_pos' echo 'r:f0 do_filp_open pos=+0x68($retval):u64' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-1371049 [008] d... 6315439.568814: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0 systemd-journal-3665 [012] d... 6315439.568962: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0 systemd-journal-3665 [012] d... 6315439.571519: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
sk_buff 是linux网络协议栈重要的结构体,通过前面的方法,并不能直接解析到我们关注的报文内容,需要进行特殊处理。以追踪icmp接收ping报文为例,我们在__netif_receive_skb_core 函数中进行probe和过滤:
surftrace 'p __netif_receive_skb_core proto=@(struct iphdr *)l3%0->protocol ip_src=@(struct iphdr *)%0->saddr ip_dst=@(struct iphdr *)l3%0->daddr data=X@(struct iphdr *)l3%0->sdata[1] f:proto==1&&ip_src==127.0.0.1' echo 'p:f0 __netif_receive_skb_core proto=+0x9(+0xe8(%di)):u8 ip_src=+0xc(+0xe8(%di)):u32 ip_dst=+0x10(+0xe8(%di)):u32 data=+0x16(+0xe8(%di)):x16' >> /sys/kernel/debug/tracing/kprobe_events echo 'proto==1&&ip_src==0x100007f' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-1420827 [013] ..s1 6316511.011244: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a0d <...>-1420827 [013] ..s1 6316511.011264: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a15
协议的获取表达式为 @(struct iphdr *)l3%0->protocol,和之前不一样的是,寄存器的结构体名左括号加了@符号进行特殊标记,表示需要用该结构体来解析skb->data指针数据,结构体名和右括号后加了l3标记(命名为右标记),表示当前skb->data指向了TCP/IP 层3位置。
trace event 信息参考 /sys/kernel/debug/tracing/events目录下的事件 描述,以追踪wakeup等待超过10ms任务为例
命令行 surftrace 'e sched/sched_stat_wait f:delay>1000000'
surftrace 'e sched/sched_stat_wait f:delay>1000000' echo 'delay>1000000' > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/filter echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on-0 [001] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046552 delay=87023763 [ns] -0 [005] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046617 delay=87360020 [ns]
以访问 task_group_cache 这个全局符号为例,它的定义如下:
static struct kmem_cache *task_group_cache __read_mostly;
获取指针信息
surftrace 'p wake_up_new_task point=@task_group_cache' echo 'p:f0 wake_up_new_task point=@task_group_cache' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-3626383 [000] .... 12192156.289170: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500 <...>-2282088 [006] .... 12192156.294148: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500 <...>-3626558 [001] .... 12192156.305044: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500 <...>-3626558 [001] .... 12192156.305133: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500
解析变量结构体内容:
surftrace 'p wake_up_new_task name=!(struct kmem_cache*)@task_group_cache->name size=!(struct kmem_cache*)@task_group_cache->size' echo 'p:f0 wake_up_new_task name=+0x0(+0x58(@task_group_cache)):string size=+0x18(@task_group_cache):u32' >> /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on <...>-3736660 [014] .... 12192459.242704: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704 <...>-2282088 [008] .... 12192459.266579: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704 <...>-3736816 [001] .... 12192459.278101: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704 <...>-3736816 [001] .... 12192459.278169: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
根据ftrace要求,访问地址必须要在内核地址范围内。继续以以访问 task_group_cache 这个全局符号为例,首先获取符号地址
cat /proc/kallsyms |grep task_group_cache ffffffff8647bc30 d task_group_cache
查询命令(注意,不同内核地址不一致,不能简单复制):
surftrace 'p wake_up_new_task name=!(struct kmem_cache*)@0xffffffff8647bc30->name size=!(struct kmem_cache*)@0xffffffff8647bc30->size'
echo 'p:f0 wake_up_new_task name=+0x0(+0x58(@0xffffffff8647bc30)):string size=+0x18(@0xffffffff8647bc30):u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-3910607 [012] .... 12193362.784157: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
<...>-3386586 [012] .... 12193362.960034: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
<...>-3386586 [012] .... 12193362.963222: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
surfGuide可以直接运行,命令行已经有一些使用帮助提示。现在手头任务紧张,等有空了再补充完善吧。
安装:pip install surfGuide
然后运行 surfGuide 就可以使用了。
同上
同上上
pylcc在libbpf基础上进行封装,将复杂的编译工程交由容器执行
基本要求
执行pip install pylcc安装
git clone [email protected]:aliyun/surftrace.git
示例代码 在目录 tool/pylcc/guide下
hello.py 代码
import time from pylcc.lbcBase import ClbcBase bpfPog = r""" #include "lbc.h" SEC("kprobe/wake_up_new_task") int j_wake_up_new_task(struct pt_regs *ctx) { struct task_struct* parent = (struct task_struct *)PT_REGS_PARM1(ctx); bpf_printk("hello lcc, parent: %d\n", _(parent->tgid)); return 0; } char _license[] SEC("license") = "GPL"; """ class Chello(ClbcBase): def __init__(self): super(Chello, self).__init__("hello", bpf_str=bpfPog) while True: time.sleep(1) if __name__ == "__main__": hello = Chello() pass
6.3.1.1 bpf代码说明:
#include "vmlinux.h" #include#include #include #include
6.3.1.2、python代码实现部分说明:
python 部分代码从ClbcBase 类继承,__init__函数中,第一入参必须要指定,用于指定生成so的文件名。在执行完__init__函数后,bfp模块就已经注入到内核当中去执行了。
6.3.1.3、执行效果:
执行 python2 hello.py 运行,并查看编译结果:
#cat /sys/kernel/debug/tracing/trace_pipe <...>-1091294 [005] d... 17658161.425644: : hello lcc, parent: 106880 <...>-4142485 [003] d... 17658161.428568: : hello lcc, parent: 4142485 <...>-4142486 [002] d... 17658161.430972: : hello lcc, parent: 4142486 <...>-4142486 [002] d... 17658161.431228: : hello lcc, parent: 4142486 <...>-4142486 [002] d... 17658161.431557: : hello lcc, parent: 4142486 <...>-4142485 [003] d... 17658161.435385: : hello lcc, parent: 4142485 <...>-4142490 [000] d... 17658161.437562: : hello lcc, parent: 4142490
此时可以看到目录下新增了hello.so 文件,如果文件时间戳有更新,只要bpfProg部分内容不发生改变,就不会触发重编动作。如果bpfProg 发生变换,就会触发重新编译动作,生成新的so
代码参考 eventOut.py
import ctypes as ct from pylcc.lbcBase import ClbcBase bpfPog = r""" #include "lbc.h" #define TASK_COMM_LEN 16 struct data_t { u32 c_pid; u32 p_pid; char c_comm[TASK_COMM_LEN]; char p_comm[TASK_COMM_LEN]; }; LBC_PERF_OUTPUT(e_out, struct data_t, 128); SEC("kprobe/wake_up_new_task") int j_wake_up_new_task(struct pt_regs *ctx) { struct task_struct* parent = (struct task_struct *)PT_REGS_PARM1(ctx); struct data_t data = {}; data.c_pid = bpf_get_current_pid_tgid() >> 32; bpf_get_current_comm(&data.c_comm, TASK_COMM_LEN); data.p_pid = BPF_CORE_READ(parent, pid); bpf_core_read(&data.p_comm[0], TASK_COMM_LEN, &parent->comm[0]); bpf_perf_event_output(ctx, &e_out, BPF_F_CURRENT_CPU, &data, sizeof(data)); return 0; } char _license[] SEC("license") = "GPL"; """ class CeventOut(ClbcBase): def __init__(self): super(CeventOut, self).__init__("eventOut", bpf_str=bpfPog) def _cb(self, cpu, data, size): stream = ct.string_at(data, size) e = self.maps['e_out'].event(stream) print("current pid:%d, comm:%s. wake_up_new_task pid: %d, comm: %s" % ( e.c_pid, e.c_comm, e.p_pid, e.p_comm )) def loop(self): self.maps['e_out'].open_perf_buffer(self._cb) try: self.maps['e_out'].perf_buffer_poll() except KeyboardInterrupt: print("key interrupt.") exit() if __name__ == "__main__": e = CeventOut() e.loop()
6.3.2.1 bpf部分代码说明:
6.3.2.2 python部分代码说明
以loop函数为入口:
接下来看_cb 回调函数:
6.3.2.3 执行结果
python2 eventOut.py current pid:241808, comm:python. wake_up_new_task parent pid: 241871, comm: python current pid:1, comm:systemd. wake_up_new_task parent pid: 1, comm: systemd ……
在3.2的基础上,参考dynamicVar.py,如果只想动态过滤parent进程id为 241871,可以借鉴bcc的思路进行替换,大部分代码与eventOut.py一致,首先在bpfProg代码添加了过滤动作:
…… u32 pid = BPF_CORE_READ(parent, pid); if (pid != FILTER_PID) { return 0; } ……
然后在main入口处进行替换
if __name__ == "__main__": bpfPog = bpfPog.replace("FILTER_PID", sys.argv[1]) e = CdynamicVar() e.loop
将要过滤的参数传入,执行效果
python2 dynamicVar.py 241871 current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python
代码参考 hashMap.py,大部分代码与eventOut.py一致。
6.3.4.1 bpf 部分代码
定义hashmap
LBC_HASH(pid_cnt, u32, u32, 1024);
使用方法和libbfp一致
u32 *pcnt, cnt; pcnt = bpf_map_lookup_elem(&pid_cnt, &pid); cnt = pcnt ? *pcnt + 1 : 1; bpf_map_update_elem(&pid_cnt, &pid, &cnt, BPF_ANY);
6.3.4.2 python部分代码
查询maps的位置在exit退出之前打印所有信息
…… dMap = self.maps['pid_cnt'] print(dMap.get()) exit()
哈希表对象可以直接由 self.maps['pid_cnt'] 方法获取到,可以调用get函数,获取到dict对象。
除了BPF_MAP_TYPE_HASH,lcc当前还支持BPF_MAP_TYPE_LRU_HASH、BPF_MAP_TYPE_PERCPU_HASH、 BPF_MAP_TYPE_LRU_PERCPU_HASH等类型,更多类型支持在完善中,敬请期待。
获取内核调用栈是bpf一项非常重要的调试功能,参考 callStack.py,大部分代码与eventOut.py一致。
6.3.5.1、bpf部分代码说明
外传的数据结构体中增加stack_id成员,接下来定义一个call stack成员
struct data_t { u32 c_pid; u32 p_pid; char c_comm[TASK_COMM_LEN]; char p_comm[TASK_COMM_LEN]; u32 stack_id; }; LBC_PERF_OUTPUT(e_out, struct data_t, 128); LBC_STACK(call_stack,32);
在处理函数中记录call stack
data.stack_id = bpf_get_stackid(ctx, &call_stack, KERN_STACKID_FLAGS);
6.3.5.2、python部分代码
通过getStacks传入stack_id,即可获取调用栈符号数组,然后列出来即可
stacks = self.maps['call_stack'].getStacks(e.stack_id) print("call trace:") for s in stacks: print(s)
6.3.5.3、执行结果
python callStack.py remote server compile success. current pid:1, comm:systemd. wake_up_new_task pid: 1, common: systemd call trace: startup_64 do_syscall_64 entry_SYSCALL_64_after_swapgs
参考 codeSeparate.py 和 independ.bpf.c,它的功能实现和eventOut.py 完全一致,不一样的是将python和bpf.c的功能拆分到了两个文件中去实现。 我们只需要关注下__init__函数
def __init__(self): super(codeSeparate, self).__init__("independ")
它没有了 bpf_str 入参,此时lcc会尝试从当前目录上下,去找independ.bpf.c并提请编译加载。
由于bcc 库内部集成了庞大的 LLVM/Clang 库,使其在使用过程中会遇到一些问题:
以tools/pylcc/pytool/filelife.py工具为例,与bcc/tools/filelife功能一致的前提下,性能差异对比:
pylcc | bcc | |
---|---|---|
rss | 10352 | 92288 |
vmpeak | 207444 | 369672 |
vmdata | 201284 | 363484 |
汇总对比如下表,同样的python应用,pylcc在cpu和mem等资源消耗均比bcc有较明显的优势
pylcc | bcc | |
---|---|---|
启动阶段 cpu占用 |
0% | 50%+ |
运行阶段 rss占用 |
1 | 9 |
#ifndef LBC_LBC_H #define LBC_LBC_H #define _LINUX_POSIX_TYPES_H #define __ASM_GENERIC_POSIX_TYPES_H #define PERF_MAX_STACK_DEPTH 127 #define BPF_F_FAST_STACK_CMP (1ULL << 9) #define KERN_STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP) #define USER_STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK) typedef unsigned long long u64; typedef signed long long s64; typedef unsigned int u32; typedef signed int s32; typedef unsigned short u16; typedef signed short s16; typedef unsigned char u8; typedef signed char s8; enum { BPF_ANY = 0, /* create new element or update existing */ BPF_NOEXIST = 1, /* create new element if it didn't exist */ BPF_EXIST = 2, /* update existing element */ BPF_F_LOCK = 4, /* spin_lock-ed map_lookup/map_update */ }; #define LBC_PERF_OUTPUT(MAPS, CELL, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, \ .key_size = sizeof(int), \ .value_size = sizeof(s32), \ .max_entries = ENTRIES, \ } #define LBC_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_HASH, \ .key_size = sizeof(KEY_T), \ .value_size = sizeof(VALUE_T), \ .max_entries = ENTRIES, \ } #define LBC_LRU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_LRU_HASH, \ .key_size = sizeof(KEY_T), \ .value_size = sizeof(VALUE_T), \ .max_entries = ENTRIES, \ } #define LBC_PERCPU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_PERCPU_HASH, \ .key_size = sizeof(KEY_T), \ .value_size = sizeof(VALUE_T), \ .max_entries = ENTRIES, \ } #define LBC_LRU_PERCPU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_LRU_PERCPU_HASH, \ .key_size = sizeof(KEY_T), \ .value_size = sizeof(VALUE_T), \ .max_entries = ENTRIES, \ } #define LBC_STACK(MAPS, ENTRIES) \ struct bpf_map_def SEC("maps") MAPS = { \ .type = BPF_MAP_TYPE_STACK_TRACE, \ .key_size = sizeof(u32), \ .value_size = PERF_MAX_STACK_DEPTH * sizeof(u64), \ .max_entries = ENTRIES, \ } #define _(P) ({typeof(P) val = 0; bpf_probe_read((void*)&val, sizeof(val), (const void*)&P); val;}) #include "vmlinux.h" #include#include #include #include #ifndef NULL #define NULL ((void*)0) #endif #ifndef ntohs #define ntohs(x) (0xff00 & x << 8) \ |(0x00ff & x >> 8) #endif #ifndef ntohl #define ntohl(x) (0xff000000 & x << 24) \ |(0x00ff0000 & x << 8) \ |(0x0000ff00 & x >> 8) \ |(0x000000ff & x >> 24) #endif #ifndef ntohll #define ntohll(x) ((((long long)ntohl(x))<<32) + (ntohl((x)>>32))) #endif #define BPF_F_CURRENT_CPU 0xffffffffULL #endif //LBC_LBC_H
以解析anolis发行版,rpm包名:kernel-debug-debuginfo-4.19.91-23.4.an8.x86_64.rpm为例。需要准备好一台x86_64实例,确保该实例可以访问上面的url。
下载容器镜像
docker pull liaozhaoyan/dbhive
# tree tree . └── x86_64 ├── btf │ └── anolis ├── db │ └── anolis ├── funcs │ └── anolis ├── head │ └── anolis ├── pack │ └── anolis └── vmlinux └── anolis
1.顶级目录可以定义,本例定义为dbhive;
2.一级目录为arch名,当前容器仅支持x86_64和aarch64
3.二级目录为各个功能组目录,可以执行以下命令批量创建:
export RELEASE=anolis mkdir -p btf/$RELEASE db/$RELEASE funcs/$RELEASE head/$RELEASE pack/$RELEASE vmlinux/$RELEASE
4.三级目录是发行版的名字,已经在步骤3中创建好了
docker run --net=host --privileged=true -v /root/1ext/vmhive:/home/vmhive/ --name dbhived -itd liaozhaoyan/dbhive /usr/sbin/init
docker exec -it dbhived bash cd /home/dbhive/ python3 getVmlinux.py proc kernel-debug-debuginfo-4.19.91-23.4.an8.x86_64.rpm, x86_64 4728267 blocks strip: /home/vmhive/x86_64/btf/anolis/stlpkyQL: warning: allocated section `.BTF' not in segment gen /home/vmhive/x86_64/db/anolis/info-debuginfo-4.19.91-23.4.an8.x86_64.db No symbol "__int128" in current context. failed to parse type __int128 This context has class, struct or enum irte, not a union. ……
此时开始解析所有的内核符号,解析完毕以后,会在host侧的vmhive/x86_64/db/anolis 目录下生成用于surftrace使用的db文件。