解决好公司的好问题,一直是自己从业的准则,我认为这是一件十分幸福的事情。
datadog system-probe 模块打开network-tracer后出现问题:
ERROR | (cmd/system-probe/api/module/loader.go:60 in Register) | new module `network_tracer` error: error guessing offsets: could not load bpf module for offset guessing: load BTF maps: section .maps: map tracer_status already exists
offset-guess.o 模块是偏移量猜想的一个模块,我估计datadog这么做的原因是处于linux内核兼容性的原因,但是这种兼容性的代码为什么会在这里出现问题呢?
查找代码位置,出现问题位置:
if err := ec.loadBTFMaps(maps); err != nil {
return nil, fmt.Errorf("load BTF maps: %w", err)
}
loadBTFMaps
v, ok := vs.Type.(*btf.Var)
if !ok {
return fmt.Errorf("section %v: unexpected type %s", sec.Name, vs.Type)
}
name := string(v.Name)
// The BTF metadata for each Var contains the full length of the map
// declaration, so read the corresponding amount of bytes from the ELF.
// This way, we can pinpoint which map declaration contains unexpected
// (and therefore unsupported) data.
_, err := io.Copy(internal.DiscardZeroes{}, io.LimitReader(rs, int64(vs.Size)))
if err != nil {
return fmt.Errorf("section %v: map %s: initializing BTF map definitions: %w", sec.Name, name, internal.ErrNotSupported)
}
if maps[name] != nil {
return fmt.Errorf("section %v: map %s already exists", sec.Name, name)
}
会发现出现这个问题是因为btf里面的ebpf的.maps 出现了两次,那么我进一步定位elf文件看下里面的.maps情况
readelf -x .maps -r /opt/datadog-agent/embedded/share/system-probe/ebpf/offset-guess.o
发现.maps出现了两次,正常情况下不应该这样
Hex dump of section '.maps':
0x00000000 00000000 00000000 00000000 00000000 ................
0x00000010 00000000 00000000 00000000 00000000 ................
0x00000020 00000000 00000000 ........
Hex dump of section '.maps':
0x00000000 00000000 00000000 00000000 00000000 ................
0x00000010 00000000 00000000 00000000 00000000 ................
0x00000020 00000000 00000000 ........
定位到出问题的文件是/opt/datadog-agent/embedded/share/system-probe/ebpf/offset-guess.o
进一步这个文件出问题的阶段可能是在编译时期,那么我们回顾ebpf 模块的整个编译周期,发现有一个错误出现了
[19/33] clang -MD -MF pkg/ebpf/bytecode/build/offset-guess-debug.bc.d -emit-llvm -D__KERNEL__ -DCONFIG_64BIT -D__BPF_TRACING__ -DKBUILD_MODNAME=\"ddsysprobe\" -Wno-unused-value -Wno-pointer-sign -Wno-compare-distinct-pointer-types -Wunused -Wall -Werror -include pkg/ebpf/c/asm_goto_workaround.h -O2 -fno-stack-protector -fno-color-diagnostics -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-jump-tables -fmerge-all-constants -Ipkg/ebpf/c -isystem/usr/src/linux-headers-5.15.0-52-generic/include -isystem/usr/src/linux-headers-5.15.0-52-generic/include/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/include/generated/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include/uapi -isystem/usr/src/linux-headers-5.15.0-52-generic/arch/x86/include/generated -Ipkg/network/ebpf/c -g -DDEBUG=1 -c pkg/network/ebpf/c/prebuilt/offset-guess.c -o pkg/ebpf/bytecode/build/offset-guess-debug.bc
[20/33] cd pkg/network/http && CC=clang go tool cgo -godefs -- -fsigned-char http_types.go | go run /home/zhanglei/data/datadog-agent/pkg/ebpf/cgo/genpost.go > http_types_linux.go
cgo-builtin-prolog:1:10: fatal error: 'stddef.h' file not found
#include /* for ptrdiff_t and size_t below */
看到这个错误一切就变得非常简单了,意味着clang 版本很低和操作系统并不兼容,因为data-dog依赖genpost生成cgo文件,用于运行时编译
datadog里面的system-probe 使用的clang版本是12明显和主机的不兼容,断定关键点的位置:
if clang_version_str != CLANG_VERSION:
# download correct version from dd-agent-omnibus S3 bucket
clang_url = f"https://dd-agent-omnibus.s3.amazonaws.com/llvm/clang-{CLANG_VERSION}.{arch}"
ctx.run(f"{sudo} wget -q {clang_url} -O /opt/datadog-agent/embedded/bin/clang-bpf")
ctx.run(f"{sudo} chmod 0755 /opt/datadog-agent/embedded/bin/clang-bpf")
if llc_version_str != CLANG_VERSION:
llc_url = f"https://dd-agent-omnibus.s3.amazonaws.com/llvm/llc-{CLANG_VERSION}.{arch}"
ctx.run(f"{sudo} wget -q {llc_url} -O /opt/datadog-agent/embedded/bin/llc-bpf")
ctx.run(f"{sudo} chmod 0755 /opt/datadog-agent/embedded/bin/llc-bpf")
datadog运行的clang版本是12,但是和我主机不兼容
我是ubuntu系统,所以我只需要安装和我主机兼容的clang 和llvm版本就行了
apt install clang
我安装的是clang-14,所以我需要把/usr/bin下面的clang 替换为和我的主机兼容的
cp /usr/bin/clang-14 /usr/bin/clang
重新编译datadog的system-probe模块
invoke system-probe.build
发现没有任何报错,再次运行system-probe ,没有任何报错,发现ebpf层的网络数据可以正常采集了