——————————————————————————————
2020-04-22 日更新,增加damon进程启用core dump示例代码
——————————————————————————————
在之前文章中:https://blog.csdn.net/xmcy001122/article/details/103743249 使用golang实现了优雅退出的功能。
近期程序线上运行出现了进程崩溃的问题,于是度娘加入了 defer 代码以便打印错误堆栈,运行半天后,发现进程不见了,但是log里面却没有任何错误。于是反复检测,终于找到问题所在。
出问题的代码:
func main(){
// ... 业务逻辑
// 崩溃的时候,这段代码没有执行
defer func() {
// 如果是崩溃导致的信号,打印错误
err := recover()
if err != nil {
logger.Sugar.Error(err)
}
}()
// 优雅退出
c := make(chan os.Signal)
signal.Notify(c, syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
waitExit(c)
}
func waitExit(c chan os.Signal) {
for i := range c {
switch i {
case syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT:
_ = logger.Sugar.Sync()
logger.Sugar.Info("exit...")
os.Exit(1)
}
}
}
运行的时候,模拟崩溃,goland IDE里面输出了错误,但是log里面依然不见错误输出。
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13a79af]
goroutine 36 [running]:
robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify(0xc0001640d0, 0xc00000e700, 0xc0001a6010, 0xc, 0x27f0)
/Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:92 +0x15f
robot_server/internal.(*RouteServerConn).onHandleData(0xc0001640d0, 0xc00000e700, 0xc0001a6010, 0xc, 0x27f0)
/Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:65 +0x148
robot_server/base.(*ImClient).NetLoop(0xc0000b4180, 0xc0001ea000, 0x149ab3b)
/Users/xuyc/repo/zhaogang.com/go/src/robot_server/base/im_client.go:168 +0x1f6
robot_server/internal.read(0xc0001640d0, 0xc00016c0f0, 0xb, 0xc0001a200a)
/Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:41 +0x3f1
created by robot_server/internal.(*RouteServerConn).Start
/Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:26 +0x5d
Process finished with exit code 2
尝试了多种方法,最后使用了系统生成core dump文件的方案。
1.创建coredump.sh,并写入如下内容:
#!/bin/bash
# Filename: coredumpshell.sh
# Description: enable coredump and format the name of core file on centos system
# enable coredump whith unlimited file-size for all users
echo -e "\n# enable coredump whith unlimited file-size for all users\n* soft core unlimited" >> /etc/security/limits.conf
# set the path of core file with permission 777
cd /data && mkdir imcorefile && chmod 777 imcorefile
# format the name of core file.
# %% – 符号%
# %p – 进程号
# %u – 进程用户id
# %g – 进程用户组id
# %s – 生成core文件时收到的信号
# %t – 生成core文件的时间戳(seconds since 0:00h, 1 Jan 1970)
# %h – 主机名
# %e – 程序文件名
# for centos7 system(update 2017.4.2 21:44)
echo -e "\nkernel.core_pattern=/data/imcorefile/core-%e-%s-%u-%g-%p-%t" >> /etc/sysctl.conf
echo -e "\nkernel.core_uses_pid = 1" >> /etc/sysctl.conf
sysctl -p /etc/sysctl.conf
2.永久启用core dump功能
chmod 777 coredump.sh
./coredump.sh
# 重新打开终端
ulimit -a
vim test.c
输入如下内容:
#include
int main( int argc, char * argv[] ) { char a[1]; scanf( "%s", a ); return 0; }
gcc test.c -o test # 编译
./test # 执行test,然后任意输入一串字符后按回车,如zhaogang.com
ls /data/corefile # 在此目录下如果生成了相应的core文件core-test-*,代表成功
按照上述设置后,在进程前加上 env GOTRACEBACK=crash:
env GOTRACEBACK=crash ./robot_server
如果进程崩溃,将在 /data/imcorefile 生成core文件,比如:
[root@10-0-59-229 imcorefile]# ls
core-robot_server-6-0-0-25365-1587471512
PS:关于 GOTRACEBACK 可以参考 Go – 通过GOTRACEBACK生成程序崩溃后core文件的方法(gcore gdb)
gdb robot_server ../imcorefile/core-robot_server-6-0-0-25365-1587471512
输出:
(gdb) bt # bt查看错误堆栈
#0 runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:150
#1 0x00000000004405bb in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:424
#2 0x0000000000440b3d in runtime.sigfwdgo (sig=6, info=0xc00015fd70, ctx=0xc00015fc40, ~r3=<optimized out>)
at /usr/local/go/src/runtime/signal_unix.go:629
#3 0x000000000043fc60 in runtime.sigtrampgo (sig=<optimized out>, info=0xc00015fd70, ctx=0xc00015fc40)
at /usr/local/go/src/runtime/signal_unix.go:289
#4 0x0000000000459d43 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:357
#5 0x0000000000459e30 in ?? () at /usr/local/go/src/runtime/sys_linux_amd64.s:441
#6 0x0000000000000000 in ?? ()
(gdb) eixt
Undefined command: "eixt". Try "help".
(gdb)
Undefined command: "eixt". Try "help".
(gdb) quit # 退出
结论:虽然生成了core文件,但是看不到具体的错误,所以需要使用delve这个工具了。
1.安装
go get -u github.com/derekparker/delve/cmd/dlv
go env # 查看 GOPATH="/home/go"
[root@10-0-59-229 robot-server.2020-04-14]# cd /home/go/
[root@10-0-59-229 go]# ls
bin src
[root@10-0-59-229 go]# cd bin
[root@10-0-59-229 bin]# ls
dlv
[root@10-0-59-229 bin]# cp dlv /usr/bin/ # 这样就可以直接使用dlv命令了
cp:是否覆盖"/usr/bin/dlv"? y
[root@10-0-59-229 bin]# dlv # 使用dlv命令
2.调试
格式:dlv core < executable> < core> [flags]
[root@10-0-59-231 robot-server.2020-04-22]# dlv core robot_server /data/imcorefile/core-robot_server-6-0-0-32402-1587555657
Type 'help' for list of commands.
(dlv)
(dlv)
(dlv) bt # 先使用bt显示错误堆栈
0 0x0000000000459be1 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:150
1 0x000000000044074b in runtime.dieFromSignal
at /usr/local/go/src/runtime/signal_unix.go:424
2 0x0000000000440ccd in runtime.sigfwdgo
at /usr/local/go/src/runtime/signal_unix.go:629
3 0x000000000043fdf0 in runtime.sigtrampgo
at /usr/local/go/src/runtime/signal_unix.go:289
4 0x0000000000459ed3 in runtime.sigtramp
at /usr/local/go/src/runtime/sys_linux_amd64.s:357
5 0x0000000000459fc0 in runtime.sigreturn
at /usr/local/go/src/runtime/sys_linux_amd64.s:449
6 0x00000000004408ea in runtime.crash
at /usr/local/go/src/runtime/signal_unix.go:518
7 0x000000000042c1a4 in runtime.fatalpanic
at /usr/local/go/src/runtime/panic.go:717
8 0x000000000042bb55 in runtime.gopanic
at /usr/local/go/src/runtime/panic.go:565
9 0x0000000000440661 in runtime.panicmem
at /usr/local/go/src/runtime/panic.go:82
10 0x0000000000440661 in runtime.sigpanic
at /usr/local/go/src/runtime/signal_unix.go:390
11 0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
at /home/go/src/robot_server/internal/route_server_conn.go:97
12 0x00000000007ab598 in robot_server/internal.(*RouteServerConn).onHandleData
at /home/go/src/robot_server/internal/route_server_conn.go:65
13 0x00000000007aebd2 in robot_server/internal.(*RouteServerConn).onHandleData-fm
at /home/go/src/robot_server/internal/route_server_conn.go:60
14 0x000000000079bff6 in robot_server/base.(*ImClient).NetLoop
at /home/go/src/robot_server/base/im_client.go:168
15 0x00000000007ab3e1 in robot_server/internal.read
at /home/go/src/robot_server/internal/route_server_conn.go:41
16 0x00000000004582d1 in runtime.goexit
at /usr/local/go/src/runtime/asm_amd64.s:1337
如上图所示,11行找到了错误位置。
11 0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
at /home/go/src/robot_server/internal/route_server_conn.go:97
如果运行机器上有代码的话(也可以把进程和core文件考到编译机器查看),可以显示具体错误行:
[root@10-0-59-231 robot-server.2020-04-22]# dlv core robot_server /data/imcorefile/core-robot_server-6-0-0-32402-1587555657
Type 'help' for list of commands.
(dlv)
(dlv)
(dlv) bt # 先使用bt显示错误堆栈
0 0x0000000000459be1 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:150
1 0x000000000044074b in runtime.dieFromSignal
at /usr/local/go/src/runtime/signal_unix.go:424
2 0x0000000000440ccd in runtime.sigfwdgo
at /usr/local/go/src/runtime/signal_unix.go:629
3 0x000000000043fdf0 in runtime.sigtrampgo
at /usr/local/go/src/runtime/signal_unix.go:289
4 0x0000000000459ed3 in runtime.sigtramp
at /usr/local/go/src/runtime/sys_linux_amd64.s:357
5 0x0000000000459fc0 in runtime.sigreturn
at /usr/local/go/src/runtime/sys_linux_amd64.s:449
6 0x00000000004408ea in runtime.crash
at /usr/local/go/src/runtime/signal_unix.go:518
7 0x000000000042c1a4 in runtime.fatalpanic
at /usr/local/go/src/runtime/panic.go:717
8 0x000000000042bb55 in runtime.gopanic
at /usr/local/go/src/runtime/panic.go:565
9 0x0000000000440661 in runtime.panicmem
at /usr/local/go/src/runtime/panic.go:82
10 0x0000000000440661 in runtime.sigpanic
at /usr/local/go/src/runtime/signal_unix.go:390
11 0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
at /home/go/src/robot_server/internal/route_server_conn.go:97
12 0x00000000007ab598 in robot_server/internal.(*RouteServerConn).onHandleData
at /home/go/src/robot_server/internal/route_server_conn.go:65
13 0x00000000007aebd2 in robot_server/internal.(*RouteServerConn).onHandleData-fm
at /home/go/src/robot_server/internal/route_server_conn.go:60
14 0x000000000079bff6 in robot_server/base.(*ImClient).NetLoop
at /home/go/src/robot_server/base/im_client.go:168
15 0x00000000007ab3e1 in robot_server/internal.read
at /home/go/src/robot_server/internal/route_server_conn.go:41
16 0x00000000004582d1 in runtime.goexit
at /usr/local/go/src/runtime/asm_amd64.s:1337
(dlv) frame 11 # frame+行号,显示错误的错误信息
> runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:150 (PC: 0x459be1)
Warning: debugging optimized function
Frame 11: /home/go/src/robot_server/internal/route_server_conn.go:97 (PC: 7aba4f)
92: sessionId := *msg.SessionId
93: //if msg.SessionId != nil {
94: // sessionId = *msg.SessionId
95: //}
96:
=> 97: logger.Sugar.Infof("onHandleWelcomeMsgList from_id:%d,to_id:%d,session_type:%d,app_id:%d", *msg.UserId,
98: sessionId, *msg.SessionType, *msg.AppId)
99: DefaultFaqHttpQueryPool.PushGetWelcomeReq(header, msg, false)
100: }
所以从上面能看到,应该是 msg.SessionId 空指针导致的崩溃。
一般我们的服务都需要后台运行(守护进程),配合 restart.sh 等重启脚本。那么脚本里面如何设置 env GOTRACEBACK=crash 呢?
笔者尝试直接在shell脚本里面加好像没有用:
function start() {
# 没用
./daemon env GOTRACEBACK=crash ./robot_server --log_dir=${log_dir} --conf=robotserver.conf &
sleep 2
ps -ef | grep robot_server
}
正确的做法是daemon进程里使用 exec.Command 后 设置env环境 变量:
cmd := exec.Command(fullPath, args...)
// 在Start()之前设置,启用go core dump 功能
cmd.Env = append(cmd.Env, "GOTRACEBACK=crash")
err := cmd.Start()
完整代码见下面。
package main
import (
"flag"
"fmt"
"os/exec"
"path/filepath"
"time"
)
var (
appPath string
)
func main() {
flag.Parse()
args := make([]string, 0)
for i := range flag.Args() {
if i == 0 {
appPath = flag.Arg(i)
} else {
args = append(args, flag.Arg(i))
}
}
//fmt.Println(flag.Args())
if appPath == "" {
fmt.Println("Usage:./daemon [appPath] [args]")
return
}
fullPath, _ := filepath.Abs(appPath)
cmd := exec.Command(fullPath, args...)
// 启用go core dump 功能
cmd.Env = append(cmd.Env, "GOTRACEBACK=crash")
err := cmd.Start()
time.Sleep(time.Second)
if err != nil {
fmt.Printf("daemon error:%s \n", err.Error())
} else {
fmt.Println("daemon success")
}
}
#!/bin/sh
#nohup ./robot_server -log_dir=log -alsologtostderr=true &
log_dir=./log
function create_log_dir() {
if [[ ! -d "${log_dir}" ]];then
mkdir -p ${log_dir}
fi
}
function start() {
./daemon ./robot_server --log_dir=${log_dir} --conf=robotserver.conf &
sleep 2
ps -ef | grep robot_server
}
function stop() {
if [[ -e server.pid ]]; then
pid=`cat server.pid`
echo "kill pid=$pid"
kill ${pid}
fi
}
stop
create_log_dir
start
ulimit -c # 查看core dump状态,0代表关闭,unlimited代表打开
vim /etc/profile
加入如下一句话:
# No core files by default
ulimit -S -c 0 > /dev/null 2>&1
查看是否生效:
# 重新打开终端 如果要启用,把上面那句话注释重新打开终端即可
ulimit -c 如果输出0,代表关闭成功
参考weixin_34167043,这个没试过:
logFile, err := os.OpenFile("./log/fatal.log", os.O_CREATE|os.O_APPEND|os.O_RDWR, 0660)
if err != nil {
log.Println("服务启动出错", "打开异常日志文件失败" , err)
return
}
// 将进程标准出错重定向至文件,进程崩溃时运行时将向该文件记录协程调用栈信息
syscall.Dup2(int(logFile.Fd()), int(os.Stderr.Fd()))
golang调试工具Delve
https://www.cnblogs.com/li-peng/p/8522592.html
golang coredump分析
https://blog.csdn.net/yunlilang/article/details/83014468?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-9&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-9
golang程序因未知错误崩溃时如何记录异常 https://blog.csdn.net/weixin_34167043/article/details/94651600?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3
golang runtime
https://golang.org/src/runtime/crash_unix_test.go?m=text
推荐下自己的开源IM,纯Golang编写:
CoffeeChat:
https://github.com/xmcy0011/CoffeeChat
opensource im with server(go) and client(flutter+swift)
参考了TeamTalk、瓜子IM等知名项目,包含服务端(go)和客户端(flutter),单聊和机器人(小微、图灵、思知)聊天功能已完成,目前正在研发群聊功能,欢迎对golang和跨平台开发flutter技术感兴趣的小伙伴Star加关注。
————————————————
版权声明:本文为CSDN博主「许非」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/xmcy001122/article/details/103921991