CGO segmentation violation 问题

今天增加了一个并发的测试用例,用于验证新增的Cony On Write 在并发场景下的正确性,结果 go test -v 执行之后,测试用例直接崩溃,然后黑漆漆的终端上出现了如下报错:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x9e8b6c]

从内容上来看,关键的信息是 segmentation violation,也叫作段违规

那么什么是 segmentation violation 以及为什么会出现 segmentation violation 呢?经过一番搜索后,终于找到了我认为对 segmentation violation 解释比较贴切的一篇文章,以下是部分引用:

A "segmentation violation" signal is sent to a process of which the memory management unit detected an attempt to use a memory address that does not belong to it.

原文链接: What is a "segmentation violation"?

现代硬件设备都会包含一个 memory management unit(MMU) 的硬件来保护内存访问,以防止不同的进程修改彼此的内存。MMU检查到一个进程试图访问不属于自己的内存时(无效的内存引用),就会发送一个SIGSEGV 的signal,进程就会出现segmentation violation 错误。

看到这里,了解协程实现的同学可能会问:为什么Go编写的测试用例会出现这个错误呢?因为Go是一门包含GC的语言,runtime管理内存的分配和回收,哪怕是并发调用的,在指针访问安全的情况下,最多也就会出现竞态条件,而不是内存访问错误啊?

是的,正常来说确实如此,不过在真正分析问题前,先交代一下问题的背景,让你有一个直观的了解。

背景

下面是之前非并发的测试用例(该用例是正确的):

func TestDispatch_V710(t *testing.T) {
    gen := datacenter.Gen{
        RealOrderCount:         60,
        RelayOrderCount:        20,
        ShortAppointOrderCount: 15,
        LongAppointOrderCount:  5,
    }

    // 生成数据
    gen.Do(nil, nil)

    // 初始化策略引擎
    if err := strategy.Init("../../conf/strategy_engine_conf.yaml"); err != nil {
        t.Error(err)
        os.Exit(1)
    }

    // 模拟计算策略
    dataCenter := gen.GetDataCenter()
    utils.NewSimulationStrategy(dataCenter, nil, strategy.GetStrategyTree()).Do()

    // 算法引擎做最优化匹配
    dispatch := optimal.Dispatch{}
    dispatch.OptimalDispatch(dataCenter)
}

下面是增加并发后的测试用例:

func TestDispatch_OnApolloChanged_V710(t *testing.T) {
    // 初始化策略引擎
    if err := strategy.Init("../../conf/strategy_engine_conf.yaml"); err != nil {
        t.Error(err)
        os.Exit(1)
    }

    manager, err := pkgUtils.NewApolloManager(&pkgUtils.ApolloManagerConfig{
        ConfigServerURL: "http://192.168.205.10:8080",
        AppID:           "strategydispatch",
        Cluster:         "default",
        Namespaces:      []string{strategy.ApolloNamespace},
        BackupFile:      "",
        IP:              "",
        AccessKey:       "",
    })
    if err != nil {
        t.Error(err)
        os.Exit(1)
    }

    // 注册策略引擎配置事件回调
    manager.RegisterHandler(strategy.ApolloNamespace, strategy.ApolloNotifyHandler, pkgUtils.ApolloErrHandler)

    go manager.Run()

    wg := sync.WaitGroup{}
    // 测试并发执行
    for i := 0; i < 2; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // 20 * 50s, 执行计算, 并测试apollo变更
            for i := 0; i < 3; i++ {
                gen := datacenter.Gen{
                    RealOrderCount:         60,
                    RelayOrderCount:        20,
                    ShortAppointOrderCount: 15,
                    LongAppointOrderCount:  5,
                }

                // 生成数据
                gen.Do(nil, nil)

                // 模拟计算策略
                dataCenter := gen.GetDataCenter()
                utils.NewSimulationStrategy(dataCenter, nil, strategy.GetStrategyTree()).Do()

                // 算法引擎做最优化匹配
                dispatch := optimal.Dispatch{}
                dispatch.OptimalDispatch(dataCenter)
                fmt.Println(dataCenter.DispatchResult)
                time.Sleep(time.Second * 5)
            }
        }()
    }

    wg.Wait()
}

仔细观察代码你会发现变量是在goroutine内部初始化的,也就是说都属于goroutine stack的 local变量,唯一一个共享的变量是

strategy.GetStrategyTree(),不过这个是为了测试COW的正确性。

同时该部分的代码存在cgo,这也是唯一有盲点的地方,因为cgo对于使用者来说是透明的,那么可能产生segmentation violation 应该只有cgo的部分了。

cgo代码

dispatch.OptimalDispatch(dataCenter) 这行代码包含cgo调用,OptimalDispatch 的函数如下:

func (d *Dispatch) OptimalDispatch(dataCenter *common.DataCenter) {
    // ......省略部分代码
    
    degrade := km.Entrance(orderCarPair, dataCenter)
    
    if degrade {
        subStart := time.Now()
        
        km.Greedy(orderCarPair, dataCenter)
        // ......省略部分代码
    }

    // ......省略部分代码
}

其中km.Entrance(orderCarPair, dataCenter)会真调用C++代码

func Entrance(Graphy map[string][]common.OrderWithCarInfo, dataCenter *common.DataCenter) (degrade bool) {
    // ......省略部分代码
    
    // 这里会调用c++代码
    result := C.entrance((*C.double)(unsafe.Pointer(&cArray[0])), C.long(max_v_num))
    
    // ......省略部分代码

}

C++的接口声明如下

long* entrance(double * input_weight, long input_max_v_num);

其中Go会向C++传递一个slice, C++也会返回给Go一个long array

定位

在文章开始的时候,由于计算部分用了goroutine pool, 错误信息没有全部复制,现在来看一下错误信息中的runtime.stack部分

 === RUN   TestDispatch_OnApolloChanged_V710
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1d8 pc=0xa1328c]

runtime stack:
runtime.throw(0xb9420c, 0x2a)
        /usr/local/lib/go/src/runtime/panic.go:1114 +0x72
runtime.sigpanic()
        /usr/local/lib/go/src/runtime/signal_unix.go:679 +0x46a

goroutine 90 [syscall]:
runtime.cgocall(0xa12360, 0xc00350ebf8, 0xf7a7d668e8941901)
        /usr/local/lib/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00350ebc8 sp=0xc00350eb90 pc=0x4059eb
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0)
        _cgo_gotypes.go:48 +0x4e fp=0xc00350ebf8 sp=0xc00350ebc8 pc=0x89e7be
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km.Entrance(0xc0000a2540, 0xc0035fe500, 0x1313ae0)
        /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/unit/km/km.go:108 +0xaad fp=0xc00350f4a8 sp=0xc00350ebf8 pc=0x89f2cd
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/optimal.(*Dispatch).OptimalDispatch(0xc00350ff98, 0xc0035fe500)        /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/optimal/dispatch.go:32 +0x49b fp=0xc00350ff38 sp=0xc00350f4a8 pc=0x8a07eb
fabu.ai/IntelligentTransport/strategy_dispatch/tests/dispatch.TestDispatch_OnApolloChanged_V710.func1(0xc00358f530)
        /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/dispatch_v710_test.go:67 +0x93 fp=0xc00350ffd8 sp=0xc00350ff38 pc=0xa11f73
runtime.goexit()
        /usr/local/lib/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00350ffe0 sp=0xc00350ffd8 pc=0x468e31
created by fabu.ai/IntelligentTransport/strategy_dispatch/tests/dispatch.TestDispatch_OnApolloChanged_V710
        /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/dispatch_v710_test.go:47 +0x2f2

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0035b7c20, 0xb8c69c, 0x21, 0xba8468, 0x48c901)
        /usr/local/lib/go/src/testing/testing.go:1044 +0x37e
testing.runTests.func1(0xc0035b7b00)
        /usr/local/lib/go/src/testing/testing.go:1285 +0x78
testing.tRunner(0xc0035b7b00, 0xc0035f5e10)
        /usr/local/lib/go/src/testing/testing.go:992 +0xdc
testing.runTests(0xc003581960, 0x12c0ec0, 0x4, 0x4, 0x0)
        /usr/local/lib/go/src/testing/testing.go:1283 +0x2a7
testing.(*M).Run(0xc0035ae200, 0x0)
        /usr/local/lib/go/src/testing/testing.go:1200 +0x15f
main.main()
        _testmain.go:54 +0x135

错误信息的runtime statck部分出现了cgo调用相关错误,其中km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0)cgo编译过程中生成的中间代码

runtime.cgocall(0xa12360, 0xc00350ebf8, 0xf7a7d668e8941901)
        /usr/local/lib/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00350ebc8 sp=0xc00350eb90 pc=0x4059eb
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0)

因此可以确定是cgo部分的代码导致了该问题。

在非并发下CGO调用是正常的,也就是说CGO代码本身是正常的。

在并发下调用CGO部分出现了问题,有可能和Go的runtime的一些机制有关系,因此需要定位到runtime部分,也就是runtime 在做cgo调用的时候哪一步出发了segmentation violation

coredump

熟悉C/C++的同学都知道,在Linux系统下,如果程序出现了内存相关的异常错误,会产生coredump文件。顺着这个思路,Go能否产生core文件呢?答案是可以的:

➜  ~ ulimit -c
0
➜  ~ ulimit -c unlimited
➜  ~ ulimit -c          
unlimited

默认的coredump文件大小为0,我设置为unlimited , 也可以合理的设置其大小。

之后编译运行程序,让其产生coredump文件

➜  ~ GOTRACEBACK=crash ./strategy_dispatch_test

GOTRACEBACK=crash 环境变量 设置为 crash 就是允许生成coredump文件了。

不过由于我是测试用例,尝试先设置GOTRACEBACK=crash ,然后 go test 无效,只能将测试用例的代码转换为可编译的main 程序。

coredump文件分析

coredump文件运行不会导致进程崩溃,有了coredump文件,就可以加载coredump文件做更进一步的分析了。

我通过dlv工具去加载coredump文件:

dlv core ./strategy_dispatch_test core

然后输入stack,打印出stack trace信息

Type 'help' for list of commands.
(dlv) stack
0  0x0000000000466931 in runtime.raise
   at /usr/local/lib/go/src/runtime/sys_linux_amd64.s:165
1  0x00000000004644a2 in runtime.asmcgocall
   at /usr/local/lib/go/src/runtime/asm_amd64.s:640
2  0x000000000040593f in runtime.cgocall
   at /usr/local/lib/go/src/runtime/cgocall.go:143
3  0x000000000087acae in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance
   at _cgo_gotypes.go:48
4  0x000000000087b7bd in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km.Entrance
   at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/unit/km/km.go:108
5  0x000000000087ccdb in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/optimal.(*Dispatch).OptimalDispatch
   at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/optimal/dispatch.go:32
6  0x00000000009e7903 in main.main.func1
   at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/tmp/tmp.go:68
7  0x0000000000464d91 in runtime.goexit
   at /usr/local/lib/go/src/runtime/asm_amd64.s:1373

通过stack trace 信息,发现在3处

3 0x000000000087acae in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance
   at _cgo_gotypes.go:48

出现了C++中间代码的调用

//go:cgo_unsafe_args
func _Cfunc_entrance(p0 *_Ctype_double, p1 _Ctype_long) (r1 *_Ctype_long) {
    _cgo_runtime_cgocall(_cgo_743da1d4b169_Cfunc_entrance, uintptr(unsafe.Pointer(&p0)))
    if _Cgo_always_false { 
        _Cgo_use(p0) 
        _Cgo_use(p1)
    }
    return
}

可以更加确定是CGO出了问题,继续跟踪stack trace信息,在2处告诉我们cgocall.go:143, 程序进入了runtime部分,

2  0x000000000040593f in runtime.cgocall
   at /usr/local/lib/go/src/runtime/cgocall.go:143

查看runtime部分对应的代码

// Call from Go to C.
//
// This must be nosplit because it's used for syscalls on some
// platforms. Syscalls may have untyped arguments on the stack, so
// it's not safe to grow or scan the stack.
//
//go:nosplit
func cgocall(fn, arg unsafe.Pointer) int32 {
    // ... 省略一些错误处理

    mp := getg().m
    mp.ncgocall++
    mp.ncgo++

    // Reset traceback.
    mp.cgoCallers[0] = 0

    // Announce we are entering a system call
    // so that the scheduler knows to create another
    // M to run goroutines while we are in the
    // foreign code.
    //
    // The call to asmcgocall is guaranteed not to
    // grow the stack and does not allocate memory,
    // so it is safe to call while "in a system call", outside
    // the $GOMAXPROCS accounting.
    //
    // fn may call back into Go code, in which case we'll exit the
    // "system call", run the Go code (which may grow the stack),
    // and then re-enter the "system call" reusing the PC and SP
    // saved by entersyscall here.
    entersyscall()

    // Tell asynchronous preemption that we're entering external
    // code. We do this after entersyscall because this may block
    // and cause an async preemption to fail, but at this point a
    // sync preemption will succeed (though this is not a matter
    // of correctness).
    osPreemptExtEnter(mp)

    mp.incgo = true
    
    // 这里是143行
    errno := asmcgocall(fn, arg)

    // ... 省略部分代码


    return errno
}

程序停在了cgocall 函数的这个位置 errno := asmcgocall(fn, arg), 这个函数是汇编实现,并且在stack trace也给出了对应代码的位置提示

1  0x00000000004644a2 in runtime.asmcgocall
   at /usr/local/lib/go/src/runtime/asm_amd64.s:640

查看 asm_amd64.s 这个文件,640行对应的汇编代码是这部分

// func asmcgocall(fn, arg unsafe.Pointer) int32
// Call fn(arg) on the scheduler stack,
// aligned appropriately for the gcc ABI.
// See cgocall.go for more details.
TEXT ·asmcgocall(SB),NOSPLIT,$0-20
    MOVQ    fn+0(FP), AX
    MOVQ    arg+8(FP), BX

    MOVQ    SP, DX

    // Figure out if we need to switch to m->g0 stack.
    // We get called to create new OS threads too, and those
    // come in on the m->g0 stack already.
    get_tls(CX)
    MOVQ    g(CX), R8
    CMPQ    R8, $0
    JEQ nosave
    MOVQ    g_m(R8), R8
    MOVQ    m_g0(R8), SI
    MOVQ    g(CX), DI
    CMPQ    SI, DI
    JEQ nosave
    MOVQ    m_gsignal(R8), SI
    CMPQ    SI, DI
    JEQ nosave

    // Switch to system stack.
    MOVQ    m_g0(R8), SI
    CALL    gosave<>(SB) // 程序崩溃在这里
    MOVQ    SI, g(CX)
    MOVQ    (g_sched+gobuf_sp)(SI), SP

    // Now on a scheduling stack (a pthread-created stack).
    // Make sure we have enough room for 4 stack-backed fast-call
    // registers as per windows amd64 calling convention.
    SUBQ    $64, SP
    ANDQ    $~15, SP    // alignment for gcc ABI
    MOVQ    DI, 48(SP)  // save g
    MOVQ    (g_stack+stack_hi)(DI), DI
    SUBQ    DX, DI
    MOVQ    DI, 40(SP)  // save depth in stack (can't just save SP, as stack might be copied during a callback)
    MOVQ    BX, DI      // DI = first argument in AMD64 ABI
    MOVQ    BX, CX      // CX = first argument in Win64
    CALL    AX

    // Restore registers, g, stack pointer.
    get_tls(CX)
    MOVQ    48(SP), DI
    MOVQ    (g_stack+stack_hi)(DI), SI
    SUBQ    40(SP), SI
    MOVQ    DI, g(CX)
    MOVQ    SI, SP

    MOVL    AX, ret+16(FP)
    RET

nosave:
    // Running on a system stack, perhaps even without a g.
    // Having no g can happen during thread creation or thread teardown
    // (see needm/dropm on Solaris, for example).
    // This code is like the above sequence but without saving/restoring g
    // and without worrying about the stack moving out from under us
    // (because we're on a system stack, not a goroutine stack).
    // The above code could be used directly if already on a system stack,
    // but then the only path through this code would be a rare case on Solaris.
    // Using this code for all "already on system stack" calls exercises it more,
    // which should help keep it correct.
    SUBQ    $64, SP
    ANDQ    $~15, SP
    MOVQ    $0, 48(SP)      // where above code stores g, in case someone looks during debugging
    MOVQ    DX, 40(SP)  // save original stack pointer
    MOVQ    BX, DI      // DI = first argument in AMD64 ABI
    MOVQ    BX, CX      // CX = first argument in Win64
    CALL    AX
    MOVQ    40(SP), SI  // restore original stack pointer
    MOVQ    SI, SP
    MOVL    AX, ret+16(FP)
    RET

640行对应的部分是 CALL gosave<>(SB) ,不过我们先不着急分析这一行汇编代码,我们先看 asmcgocall 这部分汇编代码干了什么(需要一些汇编和Plan9汇编知识)

asmcgocall汇编代码分析

整个asmcgocall函数是执行cgo调用,那么在640行(gosave)之前,函数做了什么事情呢?

TEXT ·asmcgocall(SB),NOSPLIT,$0-20
    MOVQ    fn+0(FP), AX
    MOVQ    arg+8(FP), BX

    MOVQ    SP, DX

    get_tls(CX)                 // 获取g指针
    MOVQ    g(CX), R8           // R8 = g
    CMPQ    R8, $0              // if R8 == 0, goto nosave
    JEQ nosave                  
    MOVQ    g_m(R8), R8         // R8 = g.m
    MOVQ    m_g0(R8), SI        // SI = g.m.g0
    MOVQ    g(CX), DI           // DI = g
    CMPQ    SI, DI              // if g == g.m.g0, goto nosave
    JEQ nosave
    MOVQ    m_gsignal(R8), SI   // SI = g.m.gsingal
    CMPQ    SI, DI              // if g.m.gsingal == g, goto nosave
    JEQ nosave

在上面的汇编代码中,出现三次CMQPJEQ指令,它们都会跳转到 nosave ,那么 如果CMQP成立执行了JEQnosave 是做什么呢?

nosave:
    // Running on a system stack, perhaps even without a g.
    // Having no g can happen during thread creation or thread teardown
    // (see needm/dropm on Solaris, for example).
    // This code is like the above sequence but without saving/restoring g
    // and without worrying about the stack moving out from under us
    // (because we're on a system stack, not a goroutine stack).
    // The above code could be used directly if already on a system stack,
    // but then the only path through this code would be a rare case on Solaris.
    // Using this code for all "already on system stack" calls exercises it more,
    // which should help keep it correct.
    SUBQ    $64, SP
    ANDQ    $~15, SP
    MOVQ    $0, 48(SP)      // where above code stores g, in case someone looks during debugging
    MOVQ    DX, 40(SP)  // save original stack pointer
    MOVQ    BX, DI      // DI = first argument in AMD64 ABI
    MOVQ    BX, CX      // CX = first argument in Win64
    CALL    AX
    MOVQ    40(SP), SI  // restore original stack pointer
    MOVQ    SI, SP
    MOVL    AX, ret+16(FP)
    RET

nosave部分略微有些复杂,简单来说就是当前的cgo调用可以直接运行在 系统栈,而不是协程栈

那么之前的代码就很清晰了:

  • CMPQ R8, $0 表示当前没有运行的g,自然也就不存在协程栈,可以直接运行在系统栈
  • CMPQ SI, DI g0指向的是系统栈,而如果g == g0,就表示g0运行当前的g的fn函数,自然就可以到系统栈上操作
  • CMPQ SI, DI 这个表示具体的是什么,还没有弄的很清楚,不过也是满足条件到系统栈上直接运行的。

那么当不满足到系统栈上运行时,会发生什么?asmgocall后半部分告诉了我们答案

TEXT ·asmcgocall(SB),NOSPLIT,$0-20
    // 省略前半部分代码

    // Switch to system stack.
    MOVQ    m_g0(R8), SI  // SI = g.m.g0
    CALL    gosave<>(SB) // 程序崩溃在这里
    MOVQ    SI, g(CX) // g = g.m.g0
    MOVQ    (g_sched+gobuf_sp)(SI), SP // 保存状态

    // Now on a scheduling stack (a pthread-created stack).
    // Make sure we have enough room for 4 stack-backed fast-call
    // registers as per windows amd64 calling convention.
    SUBQ    $64, SP
    ANDQ    $~15, SP    // alignment for gcc ABI
    MOVQ    DI, 48(SP)  // save g
    MOVQ    (g_stack+stack_hi)(DI), DI
    SUBQ    DX, DI
    MOVQ    DI, 40(SP)  // save depth in stack (can't just save SP, as stack might be copied during a callback)
    MOVQ    BX, DI      // DI = first argument in AMD64 ABI
    MOVQ    BX, CX      // CX = first argument in Win64
    CALL    AX

    // Restore registers, g, stack pointer.
    get_tls(CX)
    MOVQ    48(SP), DI
    MOVQ    (g_stack+stack_hi)(DI), SI
    SUBQ    40(SP), SI
    MOVQ    DI, g(CX)
    MOVQ    SI, SP

    MOVL    AX, ret+16(FP)

当不满足时

  1. 会发生栈切换,首先通过gosave保存goroutine stack,可以看一下gosave做了什么

    // func gosave(buf *gobuf)
    // save state in Gobuf; setjmp
    TEXT runtime·gosave(SB), NOSPLIT, $0-8
     MOVQ    buf+0(FP), AX       // 将 gobuf 赋值给 AX
     LEAQ    buf+0(FP), BX       // 取参数地址,也就是 caller 的 SP
     MOVQ    BX, gobuf_sp(AX)    // 保存 caller SP,再次运行时的栈顶
     MOVQ    0(SP), BX       
     MOVQ    BX, gobuf_pc(AX)    // 保存 caller PC,再次运行时的指令地址
     MOVQ    $0, gobuf_ret(AX)
     MOVQ    BP, gobuf_bp(AX)
     // Assert ctxt is zero. See func save.
     MOVQ    gobuf_ctxt(AX), BX
     TESTQ   BX, BX
     JZ  2(PC)
     CALL    runtime·badctxt(SB)
     get_tls(CX)                 // 获取 tls
     MOVQ    g(CX), BX           // 将 g 的地址存入 BX
     MOVQ    BX, gobuf_g(AX)     // 保存 g 的地址
     RET
    

    gosave会保存调度信息到g0.sched, 设置了 g0.sched.sp 和 g0.sched.pc

  2. 执行goroutine stack -> system stack

  3. 执行cgo调用(gosave之后)

问题原因猜测

协程切换

asmcgocall部分代码分析中可以得出一个结论:goroutine stack 进行了切换。

同时go官方文档中说过

calling a C function does not block other goroutines

熟悉go runtime的同学可能知道,goroutine的实现依赖TLS的,如果在一个Thread上的goroutine切换无论怎么切换,都处于一个Thread TLS内, 但如果多个Thread之间进行切换,极有可能出现该问题

假如有Goroutine [G1, G2]

  1. G1被调度到Thread1,G1在Goroutine Stack 创建了变量cArray参数传递给C调用
  2. G2被调度到Thread2,假如cArray是全局变量,如果不涉及CGO调用,程序也就race condition,但涉及CGO调用,会出现: Thread2 访问 Thread1栈空间, 也就会出现segmentation violation错误了。

但由于我们的cArray是在Goroutine局部创建的,因此这个问题可以排除掉。

TLS访问越界

还有一种情况,G1和G2调度到了线程Thread1和Thtread2,G1先创建了CGO调用运行所需的地址,G2在运行时也使用了这个地址执行CGO,但该地址在T1, G2处于Thread2

也就是说是执行过gosave做了栈切换,执行到CGO调用崩溃的。

调试验证

为了验证猜测,继续使用dlv调试, 输入grs 查看所有的goroutine,可以看到 Goroutine 71Goroutine 71 的确在不同的线程上运行了执行km._Cfunc_entrance

(dlv) grs
* Goroutine 71 - User: _cgo_gotypes.go:48 fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance (0x87af3e) (thread 11217)
  Goroutine 72 - User: _cgo_gotypes.go:48 fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance (0x87af3e) (thread 11214)

[324 goroutines]

既然这样,如果CPU只有一个core的时候,也就是只有一个Thread的时候,是否就不会出现问题呢?

通过如下代码限制Go运行时可用的CPU Core没有效果,CPU Core仍是多个。

println(runtime.NumCPU())
runtime.GOMAXPROCS(1)
println(runtime.NumCPU())

于是使用Docker容器(VM也一样),限制CPU Core = 1,果然,程序是正常运行的。

于是也就验证了之前的猜测,可能具体的原因并非是CGO的地址访问越界(可能是返回值或者其他,不过不需要在继续深挖汇编和runtime了),已经可以确定的是:多个Goroutine调度到多个Thread上执行CGO调用,会出现访问其他Thread TLS的情况,从而产生segmentation violation

解决

通过限制CPU Core的方式并不算真正的解决方式,想要解决该问题的关键在于不同的Thread上的G执行CGO调用时,不能是并发的,一种很自然的方式是 sync.Mutex

于是在Goroutine的部分增加了Lock后,即使不限制CPU仍然没有问题

事情到此,基本上可以结束了,但我们应该在试着问一下自己:sync.Mutex为什么能解决问题?

互斥锁的是让线程串行执行,Go中也不例外,Go的Mutex中Lock处于不同的模式时会使用不同的方式互斥,感兴趣的同学可以从这几部分下手

  • spin-lock 与 runtime.procyield, 会涉及到:Inter PAUSE指令流水线优化
  • sync_runtime_SemacquireMutex

你可能感兴趣的:(CGO segmentation violation 问题)