上一篇文章《Impala查询卡顿分析案例》介绍了怎么对Impala进程打印线程堆栈,JVM部分直接用 jstack 比较直接,但 C++ 部分由于要使用 gdb 或 breakpad 工具,还需要编译源码,显得比较繁琐。本文直接演示如何在 CDH 集群中打印 Impala 进程的线程堆栈,不再需要编译源码。当然第一次操作时还是需要下载一些工具,可以在集群中固定选一台机器来配置环境,以后再操作时就比较方便了。
登上 impalad 所在机器,找到 impalad 进程ID.
$ ps aux | grep impalad
root 4374 0.0 0.0 12944 972 pts/0 S+ 16:49 0:00 grep --color=auto impalad
impala 29645 1.0 3.0 2999416 231972 ? Sl 16:17 0:20 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad --flagfile=/run/cloudera-scm-agent/process/55-impala-IMPALAD/impala-conf/impalad_flags
impala 29652 0.0 0.1 197888 13556 ? Sl 16:17 0:00 python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-redactor /usr/lib/cmf/service/impala/impala.sh impalad impalad_flags false
上面进程号为 29645 就是 impalad 进程。对它发送 SIGUSR1 信号触发 minidump:
$ kill -s SIGUSR1 29645
在 /var/log/impalad/impalad.INFO 中可以找到:
Wrote minidump to /var/log/impala-minidumps/impalad/3745e5d7-9281-4548-2fd5b4b1-adc7f7eb.dmp
Impala 源码中有一个脚本 (bin/dump_breakpad_symbols.py) 可以生成 breakpad 形式的 symbol 文件。下载对应版本的 Impala 源码,可以在 cloudera github 的 release 页面查找:https://github.com/cloudera/Impala/releases
本例中 CDH 版本是 5.16.2,下载并解压 https://github.com/cloudera/Impala/archive/cdh5.16.2-release.tar.gz (大小为 692MB)
注:cloudera impala repo很大 (15GB),如果只需要一个版本的代码,没必要 git clone.
wget https://github.com/cloudera/Impala/archive/cdh5.16.2-release.tar.gz
tar zxf cdh5.16.2-release.tar.gz
cd Impala-cdh5.16.2-release
为了让 bin/dump_breakpad_symbols.py 能运行,我们还需要配置一下环境。确保 JAVA_HOME 变量指向了正确的目录,然后运行
# 确保 JAVA_HOME 变量有配置并指向了正确的目录
$ export JAVA_HOME=/usr/java/jdk1.8.0_162-cloudera
$ source bin/impala-config.sh
# 国内用户可以使用阿里云的 python 镜像,但记得打IMPALA-10994的补丁
$ export PYPI_MIRROR="http://mirrors.aliyun.com/pypi"
$ $IMPALA_HOME/infra/python/deps/download_requirements
注:如果遇到python_dateutil找不到的错,需要打IMPALA-10994的补丁来解决:https://gerrit.cloudera.org/c/17987/7/infra/python/deps/pip_download.py
没有git的话可以手动改 infra/python/deps/pip_download.py 里的 get_package_info 函数,“url = …” 那行作如下更改:
def get_package_info(pkg_name, pkg_version):
# to sort them and return the first value in alphabetical order. This ensures that the
# same result is always returned even if the ordering changed on the server.
candidates = []
- url = '{0}/simple/{1}/'.format(PYPI_MIRROR, pkg_name)
+ normalized_name = re.sub(r"[-_.]+", "-", pkg_name).lower()
+ url = '{0}/simple/{1}/'.format(PYPI_MIRROR, normalized_name)
然后需要初始化一下toolchain里的breakpad,使用 bin/bootstrap_toolchain.py
正常来说这个脚本会下载所有的toolchain,耗时较长,我们只需要breakpad部分,可以对 bin/boostrap_toolchain.py 作如下修改:
# LLVM and Kudu are the largest packages. Sort them first so that
# their download starts as soon as possible.
- packages = map(Package, ["llvm", "kudu",
- "avro", "binutils", "boost", "breakpad", "bzip2", "cmake", "crcutil",
- "flatbuffers", "gcc", "gflags", "glog", "gperftools", "gtest", "libev",
- "lz4", "openldap", "openssl", "protobuf",
- "rapidjson", "re2", "snappy", "thrift", "tpc-h", "tpc-ds", "zlib"])
- packages.insert(0, Package("llvm", "3.9.1-asserts"))
+ packages = map(Package, ["breakpad"])
bootstrap(toolchain_root, packages)
即在 bootstrap_toolchain.py 的最后部分里把其它 package 都去掉,只加上 breakpad 的。然后再执行这个脚本:
$ bin/bootstrap_toolchain.py
INFO:bootstrap_virtualenv:Creating python virtualenv
INFO:bootstrap_virtualenv:Installing packages into the virtualenv
INFO:bootstrap_virtualenv:Installing stage 2 packages into the virtualenv
2019-11-10 01:31:23,683 Thread-3 INFO: Downloading https://native-toolchain.s3.amazonaws.com/build/257-0847514126/breakpad/97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz to /root/Impala-cdh5.16.2-release/toolchain/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz (attempt 1)
2019-11-10 01:31:24,452 Thread-3 INFO: Extracting breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz
之后就可以使用 dump_breakpad_symbols.py 了,前面在用 ps 查找 impalad 进程的时候看到可执行文件是 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad,对它来生成 symbol 文件,放到 /tmp/syms 目录下:
$ bin/dump_breakpad_symbols.py -f /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad -d /tmp/syms
INFO:root:Processing binary file: /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad
上述方式生成的 symbol 文件不带有文件名和行号,如果想尽可能地结合代码,可以下载并解析对应系统的 rpm/deb 包。这些包可以在 http://archive.cloudera.com 中找到,比如 cdh5 对应的 ubuntu 的包都在 http://archive.cloudera.com/cdh5/ubuntu 下。本例中使用的系统是 ubuntu16.04,各个版本的impala cdh包在 http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/pool/contrib/i/impala 下都可以找到,下载如下两个文件:
然后仍是使用 dump_breakpad_symbols.py:
$ bin/dump_breakpad_symbols.py -r ~/Downloads/impala_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb -s ~/Downloads/impala-dbg_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb -d /tmp/syms
INFO:root:Extracting to /tmp/tmpBDEwFI: /home/quanlong/Downloads/impala_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb
INFO:root:Extracting to /tmp/tmpBDEwFI: /home/quanlong/Downloads/impala-dbg_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libstdc++.so.6.0.20
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libgcc_s.so.1
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libkudu_client.so.0.1.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libstdc++.so.6
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libkudu_client.so.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libssl.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libcrypto.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libcrypto.so.1.0.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libssl.so.1.0.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-debug/libfesupport.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-debug/impalad
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-retail/libfesupport.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-retail/impalad
这样 /tmp/syms 里的 symbol 信息就包含文件名和行号了。
使用 toolchain 里 breakpad 目录下的 minidump_stackwalk 工具就可以根据 symbol 文件来解析 minidump,假设把解析结果放到 /tmp/resolved.txt,把 breakpad 的日志放到 /tmp/breakpad.log,指令如下:
$ toolchain/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2/bin/minidump_stackwalk /var/log/impala-minidumps/impalad/3745e5d7-9281-4548-2fd5b4b1-adc7f7eb.dmp /tmp/syms > /tmp/resolved.txt 2>/tmp/breakpad.log
生成的 resolved.txt 形式如下:
Operating system: Linux
0.0.0 Linux 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64
CPU: amd64
family 6 model 63 stepping 0
2 CPUs
GPU: UNKNOWN
Crash reason: DUMP_REQUESTED
Crash address: 0x217a097
Process uptime: not available
Thread 0 (crashed)
0 impalad!google_breakpad::ExceptionHandler::WriteMinidump() + 0x57
rax = 0x0000000002149a7e rdx = 0x0000000000000000
rcx = 0x000000000217a07f rbx = 0x0000000000000000
rsi = 0x0000000000000001 rdi = 0x00007ffed049f068
rbp = 0x00007ffed049f770 rsp = 0x00007ffed049efd0
r8 = 0x0000000000000000 r9 = 0x0000000000000024
r10 = 0x0000000002288a89 r11 = 0x0000000000000000
r12 = 0x00007ffed049f630 r13 = 0x0000000000d5cff0
r14 = 0x0000000000000000 r15 = 0x00007ffed049f690
rip = 0x000000000217a097
Found by: given as instruction pointer in context
1 impalad!google_breakpad::ExceptionHandler::WriteMinidump(std::string const&, bool (*)(google_breakpad::MinidumpDescriptor const&, void*, bool), void*) + 0xf0
rbx = 0x00007f92561325a0 rbp = 0x00007ffed049f770
rsp = 0x00007ffed049f620 r12 = 0x00007ffed049f630
r13 = 0x0000000000d5cff0 r14 = 0x0000000000000000
r15 = 0x00007ffed049f690 rip = 0x000000000217a960
Found by: call frame info
2 libpthread-2.23.so + 0x11390
rbx = 0x0000000000000000 rbp = 0x00007ffed049fdd0
rsp = 0x00007ffed049f780 r12 = 0x0000000007ada458
r13 = 0x0000000007ada480 r14 = 0x0000000000000000
r15 = 0x00007ffed049fdf0 rip = 0x00007f92556fe390
Found by: call frame info
3 impalad!boost::thread::join_noexcept() + 0x5c
rbp = 0x00007ffed049fdf0 rsp = 0x00007ffed049fde0
rip = 0x0000000001334cec
Found by: previous frame's frame pointer
4 impalad!impala::ThriftServer::Join() [thread.hpp : 767 + 0x8]
rbx = 0x000000000648b420 rbp = 0x00007ffed049fe80
rsp = 0x00007ffed049fe40 r12 = 0x00007f91fef44700
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000b34f4f
Found by: call frame info
5 impalad!impala::ImpalaServer::Join() [impala-server.cc : 2151 + 0xc]
rbx = 0x0000000006621800 rbp = 0x00007ffed049feb0
rsp = 0x00007ffed049fe90 r12 = 0x00007ffed049ffb0
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000c28f8a
Found by: call frame info
6 impalad!ImpaladMain(int, char**) [impalad-main.cc : 98 + 0xc]
rbx = 0x00007ffed049ff90 rbp = 0x00007ffed04a0130
rsp = 0x00007ffed049fec0 r12 = 0x00007ffed049ffb0
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000c238e1
Found by: call frame info
......
第一个线程 (Thread 0) 标记了 Crashed,但实际是在做 minidump 的线程,上面的 Crash reason 已经写了是 DUMP_REQUESTED。实际进程 crash 时,会有具体的原因的。
解析的输出包含了很多寄存器的值,有点影响阅读,可以把它们去掉:
grep -v = /tmp/resolved.txt | grep -v 'Found by' | less
这样能看到比较舒服的堆栈:
Thread 119
0 libpthread-2.23.so + 0xd360
1 impalad!impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) [disk-io-mgr.cc : 977 + 0x5]
2 impalad!impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function, impala::ThreadDebugInfo const*, impala::Promise*) [function_template.hpp : 767 + 0x7]
3 impalad!boost::detail::thread_data, impala::ThreadDebugInfo const*, impala::Promise*), boost::_bi::list5, boost::_bi::value, boost::_bi::value >, boost::_bi::value, boost::_bi::value*> > > >::run() [bind.hpp : 525 + 0x6]
4 impalad!thread_proxy + 0xda
5 libpthread-2.23.so + 0x76ba
6 libc-2.23.so + 0x10741d
解析文件里如果没有函数名,则是 symbol 文件和 minidump 没有配对上,breakpad.log 里可能会有类似的日志:
2019-11-09 23:57:23: minidump_processor.cc:201: INFO: Looking at thread /var/log/impala-minidumps/impalad/9e41139b-a5b1-4f94-df3da8b6-c0c66040.dmp:0/155 id 0x73cd
2019-11-09 23:57:23: minidump.cc:473: INFO: MinidumpContext: looks like AMD64 context
2019-11-09 23:57:23: minidump.cc:473: INFO: MinidumpContext: looks like AMD64 context
2019-11-09 23:57:23: simple_symbol_supplier.cc:196: INFO: No symbol file at /tmp/syms/impalad/DD8351C4C1817BE1D142C187FA70CCAC0/impalad.sym
2019-11-09 23:57:23: stackwalker.cc:103: INFO: Couldn't load symbols for: /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad|DD8351C4C1817BE1D142C187FA70CCAC0
2019-11-09 23:57:23: simple_symbol_supplier.cc:196: INFO: No symbol file at /tmp/syms/libpthread-2.23.so/23E017CE2254FC6511D9BC8F534BB4F00/libpthread-2.23.so.sym
2019-11-09 23:57:23: stackwalker.cc:103: INFO: Couldn't load symbols for: /lib/x86_64-linux-gnu/libpthread-2.23.so|23E017CE2254FC6511D9BC8F534BB4F00
最重要的是 “No symbol file at /tmp/syms/impalad/DD…C0/impalad.sym” 这句,表示找不到想要的 symbol 文件。查看 /tmp/syms/impalad 目录,确实这串字符串匹配不上:
$ ls /tmp/syms/impalad/
7F9EC4C10024BDC531665853311E9CCE0
这源于我选择了错误的 impalad 文件来生成 symbol,其实要选择 impalad 进程使用文件,即 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad
在 CDH parcel 目录里有多个 impalad 文件,切记不要选错了:
$ find /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8 -name impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-debug/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/debug/usr/lib/impala/sbin-retail/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/debug/usr/lib/impala/sbin-debug/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/bin/impalad
可以的话还是使用 deb 包来 dump symbol,这样得到的信息更全,详见 2.2.2。
操作步骤:
环境配置步骤详见文章内容。