golang-进程崩溃后如何输出错误日志?

golang-进程崩溃后如何输出错误日志?

  • 背景
  • 解决方案
    • 启用core dump
    • 运行
    • 调试
      • gdb
      • Delve
    • daemon下如何设置
      • daemon.go
  • 附录
    • restart.sh脚本
    • 关闭core dump
    • 重定向方案
  • 参考
  • 关于

背景

——————————————————————————————
2020-04-22 日更新,增加damon进程启用core dump示例代码
——————————————————————————————

在之前文章中:https://blog.csdn.net/xmcy001122/article/details/103743249 使用golang实现了优雅退出的功能。

近期程序线上运行出现了进程崩溃的问题,于是度娘加入了 defer 代码以便打印错误堆栈,运行半天后,发现进程不见了,但是log里面却没有任何错误。于是反复检测,终于找到问题所在。

出问题的代码:

func main(){
     // ... 业务逻辑
     
     // 崩溃的时候,这段代码没有执行
	defer func() {
		// 如果是崩溃导致的信号,打印错误
		err := recover()
		if err != nil {
			logger.Sugar.Error(err)
		}
	}()

     // 优雅退出
	c := make(chan os.Signal)
	signal.Notify(c, syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
	waitExit(c)
}

func waitExit(c chan os.Signal) {
	for i := range c {
		switch i {
		case syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT:
			_ = logger.Sugar.Sync()
			logger.Sugar.Info("exit...")
			
			os.Exit(1)
		}
	}
}

运行的时候,模拟崩溃,goland IDE里面输出了错误,但是log里面依然不见错误输出。

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13a79af]

goroutine 36 [running]:
robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify(0xc0001640d0, 0xc00000e700, 0xc0001a6010, 0xc, 0x27f0)
        /Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:92 +0x15f
robot_server/internal.(*RouteServerConn).onHandleData(0xc0001640d0, 0xc00000e700, 0xc0001a6010, 0xc, 0x27f0)
        /Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:65 +0x148
robot_server/base.(*ImClient).NetLoop(0xc0000b4180, 0xc0001ea000, 0x149ab3b)
        /Users/xuyc/repo/zhaogang.com/go/src/robot_server/base/im_client.go:168 +0x1f6
robot_server/internal.read(0xc0001640d0, 0xc00016c0f0, 0xb, 0xc0001a200a)
        /Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:41 +0x3f1
created by robot_server/internal.(*RouteServerConn).Start
        /Users/xuyc/repo/zhaogang.com/go/src/robot_server/internal/route_server_conn.go:26 +0x5d

Process finished with exit code 2

解决方案

尝试了多种方法,最后使用了系统生成core dump文件的方案。

启用core dump

1.创建coredump.sh,并写入如下内容:

#!/bin/bash

# Filename: coredumpshell.sh
# Description: enable coredump and format the name of core file on centos system

# enable coredump whith unlimited file-size for all users
echo -e "\n# enable coredump whith unlimited file-size for all users\n* soft core unlimited" >> /etc/security/limits.conf

# set the path of core file with permission 777
cd /data && mkdir imcorefile && chmod 777 imcorefile

# format the name of core file.   
# %% – 符号%
# %p – 进程号
# %u – 进程用户id
# %g – 进程用户组id
# %s – 生成core文件时收到的信号
# %t – 生成core文件的时间戳(seconds since 0:00h, 1 Jan 1970)
# %h – 主机名
# %e – 程序文件名    
# for centos7 system(update 2017.4.2 21:44)
echo -e "\nkernel.core_pattern=/data/imcorefile/core-%e-%s-%u-%g-%p-%t" >> /etc/sysctl.conf
echo -e "\nkernel.core_uses_pid = 1" >> /etc/sysctl.conf

sysctl -p /etc/sysctl.conf

2.永久启用core dump功能

chmod 777 coredump.sh
./coredump.sh
# 重新打开终端
ulimit -a 

输出以下代表成功:
golang-进程崩溃后如何输出错误日志?_第1张图片
3.验证(不建议省略)

vim test.c 

输入如下内容:

#include 
int main( int argc, char * argv[] ) { char a[1]; scanf( "%s", a ); return 0; }
gcc test.c -o test  # 编译
./test              # 执行test,然后任意输入一串字符后按回车,如zhaogang.com

ls /data/corefile   # 在此目录下如果生成了相应的core文件core-test-*,代表成功

运行

按照上述设置后,在进程前加上 env GOTRACEBACK=crash

env GOTRACEBACK=crash ./robot_server

如果进程崩溃,将在 /data/imcorefile 生成core文件,比如:

[root@10-0-59-229 imcorefile]# ls
core-robot_server-6-0-0-25365-1587471512

PS:关于 GOTRACEBACK 可以参考 Go – 通过GOTRACEBACK生成程序崩溃后core文件的方法(gcore gdb)

调试

gdb

gdb robot_server ../imcorefile/core-robot_server-6-0-0-25365-1587471512

输出:

(gdb) bt  # bt查看错误堆栈
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:150
#1  0x00000000004405bb in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:424
#2  0x0000000000440b3d in runtime.sigfwdgo (sig=6, info=0xc00015fd70, ctx=0xc00015fc40, ~r3=<optimized out>)
    at /usr/local/go/src/runtime/signal_unix.go:629
#3  0x000000000043fc60 in runtime.sigtrampgo (sig=<optimized out>, info=0xc00015fd70, ctx=0xc00015fc40)
    at /usr/local/go/src/runtime/signal_unix.go:289
#4  0x0000000000459d43 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:357
#5  0x0000000000459e30 in ?? () at /usr/local/go/src/runtime/sys_linux_amd64.s:441
#6  0x0000000000000000 in ?? ()
(gdb) eixt
Undefined command: "eixt".  Try "help".
(gdb)
Undefined command: "eixt".  Try "help".
(gdb) quit # 退出

结论:虽然生成了core文件,但是看不到具体的错误,所以需要使用delve这个工具了。

Delve

1.安装

go get -u github.com/derekparker/delve/cmd/dlv

go env # 查看 GOPATH="/home/go"

[root@10-0-59-229 robot-server.2020-04-14]# cd /home/go/
[root@10-0-59-229 go]# ls
bin  src
[root@10-0-59-229 go]# cd bin
[root@10-0-59-229 bin]# ls
dlv
[root@10-0-59-229 bin]# cp dlv /usr/bin/   # 这样就可以直接使用dlv命令了
cp:是否覆盖"/usr/bin/dlv"? y
[root@10-0-59-229 bin]# dlv # 使用dlv命令

2.调试
格式:dlv core < executable> < core> [flags]

[root@10-0-59-231 robot-server.2020-04-22]# dlv core robot_server /data/imcorefile/core-robot_server-6-0-0-32402-1587555657
Type 'help' for list of commands.
(dlv)
(dlv)
(dlv) bt # 先使用bt显示错误堆栈 
 0  0x0000000000459be1 in runtime.raise
    at /usr/local/go/src/runtime/sys_linux_amd64.s:150
 1  0x000000000044074b in runtime.dieFromSignal
    at /usr/local/go/src/runtime/signal_unix.go:424
 2  0x0000000000440ccd in runtime.sigfwdgo
    at /usr/local/go/src/runtime/signal_unix.go:629
 3  0x000000000043fdf0 in runtime.sigtrampgo
    at /usr/local/go/src/runtime/signal_unix.go:289
 4  0x0000000000459ed3 in runtime.sigtramp
    at /usr/local/go/src/runtime/sys_linux_amd64.s:357
 5  0x0000000000459fc0 in runtime.sigreturn
    at /usr/local/go/src/runtime/sys_linux_amd64.s:449
 6  0x00000000004408ea in runtime.crash
    at /usr/local/go/src/runtime/signal_unix.go:518
 7  0x000000000042c1a4 in runtime.fatalpanic
    at /usr/local/go/src/runtime/panic.go:717
 8  0x000000000042bb55 in runtime.gopanic
    at /usr/local/go/src/runtime/panic.go:565
 9  0x0000000000440661 in runtime.panicmem
    at /usr/local/go/src/runtime/panic.go:82
10  0x0000000000440661 in runtime.sigpanic
    at /usr/local/go/src/runtime/signal_unix.go:390
11  0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
    at /home/go/src/robot_server/internal/route_server_conn.go:97
12  0x00000000007ab598 in robot_server/internal.(*RouteServerConn).onHandleData
    at /home/go/src/robot_server/internal/route_server_conn.go:65
13  0x00000000007aebd2 in robot_server/internal.(*RouteServerConn).onHandleData-fm
    at /home/go/src/robot_server/internal/route_server_conn.go:60
14  0x000000000079bff6 in robot_server/base.(*ImClient).NetLoop
    at /home/go/src/robot_server/base/im_client.go:168
15  0x00000000007ab3e1 in robot_server/internal.read
    at /home/go/src/robot_server/internal/route_server_conn.go:41
16  0x00000000004582d1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1337

如上图所示,11行找到了错误位置。

11  0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
    at /home/go/src/robot_server/internal/route_server_conn.go:97

如果运行机器上有代码的话(也可以把进程和core文件考到编译机器查看),可以显示具体错误行:

[root@10-0-59-231 robot-server.2020-04-22]# dlv core robot_server /data/imcorefile/core-robot_server-6-0-0-32402-1587555657
Type 'help' for list of commands.
(dlv)
(dlv)
(dlv) bt # 先使用bt显示错误堆栈 
 0  0x0000000000459be1 in runtime.raise
    at /usr/local/go/src/runtime/sys_linux_amd64.s:150
 1  0x000000000044074b in runtime.dieFromSignal
    at /usr/local/go/src/runtime/signal_unix.go:424
 2  0x0000000000440ccd in runtime.sigfwdgo
    at /usr/local/go/src/runtime/signal_unix.go:629
 3  0x000000000043fdf0 in runtime.sigtrampgo
    at /usr/local/go/src/runtime/signal_unix.go:289
 4  0x0000000000459ed3 in runtime.sigtramp
    at /usr/local/go/src/runtime/sys_linux_amd64.s:357
 5  0x0000000000459fc0 in runtime.sigreturn
    at /usr/local/go/src/runtime/sys_linux_amd64.s:449
 6  0x00000000004408ea in runtime.crash
    at /usr/local/go/src/runtime/signal_unix.go:518
 7  0x000000000042c1a4 in runtime.fatalpanic
    at /usr/local/go/src/runtime/panic.go:717
 8  0x000000000042bb55 in runtime.gopanic
    at /usr/local/go/src/runtime/panic.go:565
 9  0x0000000000440661 in runtime.panicmem
    at /usr/local/go/src/runtime/panic.go:82
10  0x0000000000440661 in runtime.sigpanic
    at /usr/local/go/src/runtime/signal_unix.go:390
11  0x00000000007aba4f in robot_server/internal.(*RouteServerConn).onHandleMsgJoinNotify
    at /home/go/src/robot_server/internal/route_server_conn.go:97
12  0x00000000007ab598 in robot_server/internal.(*RouteServerConn).onHandleData
    at /home/go/src/robot_server/internal/route_server_conn.go:65
13  0x00000000007aebd2 in robot_server/internal.(*RouteServerConn).onHandleData-fm
    at /home/go/src/robot_server/internal/route_server_conn.go:60
14  0x000000000079bff6 in robot_server/base.(*ImClient).NetLoop
    at /home/go/src/robot_server/base/im_client.go:168
15  0x00000000007ab3e1 in robot_server/internal.read
    at /home/go/src/robot_server/internal/route_server_conn.go:41
16  0x00000000004582d1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1337





(dlv) frame 11  # frame+行号,显示错误的错误信息
> runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:150 (PC: 0x459be1)
Warning: debugging optimized function
Frame 11: /home/go/src/robot_server/internal/route_server_conn.go:97 (PC: 7aba4f)
    92:		sessionId := *msg.SessionId
    93:		//if msg.SessionId != nil {
    94:		//	sessionId = *msg.SessionId
    95:		//}
    96:
=>  97:		logger.Sugar.Infof("onHandleWelcomeMsgList from_id:%d,to_id:%d,session_type:%d,app_id:%d", *msg.UserId,
    98:			sessionId, *msg.SessionType, *msg.AppId)
    99:		DefaultFaqHttpQueryPool.PushGetWelcomeReq(header, msg, false)
   100:	}

所以从上面能看到,应该是 msg.SessionId 空指针导致的崩溃。

daemon下如何设置

一般我们的服务都需要后台运行(守护进程),配合 restart.sh 等重启脚本。那么脚本里面如何设置 env GOTRACEBACK=crash 呢?

笔者尝试直接在shell脚本里面加好像没有用:

function start() {
    # 没用
    ./daemon env GOTRACEBACK=crash ./robot_server --log_dir=${log_dir} --conf=robotserver.conf &

    sleep 2
    ps -ef | grep robot_server
}

正确的做法是daemon进程里使用 exec.Command设置env环境 变量:

cmd := exec.Command(fullPath, args...)
// 在Start()之前设置,启用go core dump 功能
cmd.Env = append(cmd.Env, "GOTRACEBACK=crash")
err := cmd.Start()

完整代码见下面。

daemon.go

package main

import (
	"flag"
	"fmt"
	"os/exec"
	"path/filepath"
	"time"
)

var (
	appPath string
)

func main() {
	flag.Parse()

	args := make([]string, 0)
	for i := range flag.Args() {
		if i == 0 {
			appPath = flag.Arg(i)
		} else {
			args = append(args, flag.Arg(i))
		}
	}

	//fmt.Println(flag.Args())
	if appPath == "" {
		fmt.Println("Usage:./daemon [appPath] [args]")
		return
	}

	fullPath, _ := filepath.Abs(appPath)
	cmd := exec.Command(fullPath, args...)
	// 启用go core dump 功能
	cmd.Env = append(cmd.Env, "GOTRACEBACK=crash")
	err := cmd.Start()
	time.Sleep(time.Second)

	if err != nil {
		fmt.Printf("daemon error:%s \n", err.Error())
	} else {
		fmt.Println("daemon success")
	}
}

附录

restart.sh脚本

#!/bin/sh
#nohup ./robot_server -log_dir=log -alsologtostderr=true &

log_dir=./log

function create_log_dir() {
    if [[ ! -d "${log_dir}" ]];then
        mkdir -p ${log_dir}
    fi
}

function start() {
    ./daemon ./robot_server --log_dir=${log_dir} --conf=robotserver.conf &

    sleep 2
    ps -ef | grep robot_server
}

function stop() {
    if [[ -e server.pid  ]]; then
        pid=`cat server.pid`
        echo "kill pid=$pid"
        kill ${pid}
    fi
}

stop
create_log_dir
start

关闭core dump

ulimit -c        # 查看core dump状态,0代表关闭,unlimited代表打开
vim /etc/profile 

加入如下一句话:

# No core files by default
ulimit -S -c 0 > /dev/null 2>&1 

查看是否生效:

# 重新打开终端 如果要启用,把上面那句话注释重新打开终端即可
ulimit -c  如果输出0,代表关闭成功

重定向方案

参考weixin_34167043,这个没试过:

logFile, err := os.OpenFile("./log/fatal.log", os.O_CREATE|os.O_APPEND|os.O_RDWR, 0660)
if err != nil {
    log.Println("服务启动出错",  "打开异常日志文件失败" , err)
    return
}
// 将进程标准出错重定向至文件,进程崩溃时运行时将向该文件记录协程调用栈信息
syscall.Dup2(int(logFile.Fd()), int(os.Stderr.Fd())) 

参考

golang调试工具Delve
https://www.cnblogs.com/li-peng/p/8522592.html

golang coredump分析
https://blog.csdn.net/yunlilang/article/details/83014468?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-9&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-9

golang程序因未知错误崩溃时如何记录异常 https://blog.csdn.net/weixin_34167043/article/details/94651600?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3

golang runtime
https://golang.org/src/runtime/crash_unix_test.go?m=text

关于

推荐下自己的开源IM,纯Golang编写:

CoffeeChat:
https://github.com/xmcy0011/CoffeeChat
opensource im with server(go) and client(flutter+swift)

参考了TeamTalk、瓜子IM等知名项目,包含服务端(go)和客户端(flutter),单聊和机器人(小微、图灵、思知)聊天功能已完成,目前正在研发群聊功能,欢迎对golang和跨平台开发flutter技术感兴趣的小伙伴Star加关注。

————————————————
版权声明:本文为CSDN博主「许非」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/xmcy001122/article/details/103921991

你可能感兴趣的:(Golang学习,IM开发)