Go 语言中无处不在的系统调用

什么是系统调用

In computing, a system call (commonly abbreviated to syscall) is the programmatic way in which a computer program requests a service from the kernel of the operating system on which it is executed. This may include hardware-related services (for example, accessing a hard disk drive), creation and execution of new processes, and communication with integral kernel services such as process scheduling. System calls provide an essential interface between a process and the operating system.

以上是 wiki 的定义，系统调用是程序向操作系统内核请求服务的过程，通常包含硬件相关的服务(例如访问硬盘),创建新进程等。系统调用提供了一个进程和操作系统之间的接口。

Syscall 意义

内核提供用户空间程序与内核空间进行交互的一套标准接口，这些接口让用户态程序能受限访问硬件设备，比如申请系统资源，操作设备读写，创建新进程等。用户空间发生请求，内核空间负责执行，这些接口便是用户空间和内核空间共同识别的桥梁，这里提到两个字“受限”，是由于为了保证内核稳定性，而不能让用户空间程序随意更改系统，必须是内核对外开放的且满足权限的程序才能调用相应接口。

在用户空间和内核空间之间，有一个叫做 Syscall (系统调用, system call)的中间层，是连接用户态和内核态的桥梁。这样即提高了内核的安全型，也便于移植，只需实现同一套接口即可。如Linux系统，用户空间通过向内核空间发出 Syscall 指令，产生软中断，从而让程序陷入内核态，执行相应的操作。对于每个系统调用都会有一个对应的系统调用号。

安全性与稳定性：内核驻留在受保护的地址空间，用户空间程序无法直接执行内核代码，也无法访问内核数据，必须通过系统调用。

Go 语言系统调用的实现

系统调用的流程如下

系统调用.png

入口

源码基于 go1.15，位于src/syscall/asm_linux_amd64，都是汇编实现的，从注释可以看到函数签名如下

func Syscall(trap int64, a1, a2, a3 uintptr) (r1, r2, err uintptr)
func Syscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2, err uintptr)
func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2, err uintptr)
func RawSyscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2, err uintptr)

Syscall 和 Syscall6 的区别只是参数的个数不一样，Syscall 和 RawSyscall 的区别是，前者调用了 runtime 库的进入和退出系统调用函数，通知运行时进行一些操作，分别是CALL runtime·entersyscall(SB) 和 CALL runtime·exitsyscall(SB)。RawSyscall 只是为了在执行那些一定不会阻塞的系统调用时，能节省两次对 runtime 的函数调用消耗。假如 RawSyscall 执行了阻塞的系统调用，由于未调用 entersyscall 函数，当前 G 的状态还是 running 状态，只能等待 sysmon 系统监控的 retake 函数来检测运行时间是否超过阈值（10-20ms）,即发送信号抢占调度。

系统调用管理

Go 定义了如下几种系统调用
1、阻塞式系统调用，注释类似这种 //sys ，编译完调用的是 Syscall 或 Syscall6

// 源码位于，src/syscall/syscall_linux.go

//sys   unlinkat(dirfd int, path string, flags int) (err error)

// 源码位于，src/syscall/zsyscall_linux_amd64.go

func unlinkat(dirfd int, path string, flags int) (err error) {
    var _p0 *byte
    _p0, err = BytePtrFromString(path)
    if err != nil {
        return
    }
    _, _, e1 := Syscall(SYS_UNLINKAT, uintptr(dirfd), uintptr(unsafe.Pointer(_p0)), uintptr(flags))
    if e1 != 0 {
        err = errnoErr(e1)
    }
    return
}

2、非阻塞式系统调用，注释类似这种 //sysnb，编译完调用的是 RawSyscall 或 RawSyscall6

// 源码位于，src/syscall/syscall_linux.go

//sysnb EpollCreate1(flag int) (fd int, err error)

// 源码位于，src/syscall/zsyscall_linux_amd64.go
func EpollCreate1(flag int) (fd int, err error) {
    r0, _, e1 := RawSyscall(SYS_EPOLL_CREATE1, uintptr(flag), 0, 0)
    fd = int(r0)
    if e1 != 0 {
        err = errnoErr(e1)
    }
    return
}

3、包装版系统调用，系统调用名字太难记了，给封装换个名

// 源码位于，src/syscall/syscall_linux.go

func Chmod(path string, mode uint32) (err error) {
    return Fchmodat(_AT_FDCWD, path, mode, 0)
}

4、runtime 库内部用汇编也封装了一些系统调用的执行函数，无论阻塞与否都不会调用runtime.entersyscall() 和 runtime.exitsyscall()

// 源码位于，src/runtime/sys_linux_amd64.s

TEXT runtime·write1(SB),NOSPLIT,$0-28
    MOVQ    fd+0(FP), DI
    MOVQ    p+8(FP), SI
    MOVL    n+16(FP), DX
    MOVL    $SYS_write, AX
    SYSCALL
    MOVL    AX, ret+24(FP)
    RET

TEXT runtime·read(SB),NOSPLIT,$0-28
    MOVL    fd+0(FP), DI
    MOVQ    p+8(FP), SI
    MOVL    n+16(FP), DX
    MOVL    $SYS_read, AX
    SYSCALL
    MOVL    AX, ret+24(FP)
    RET

系统调用和调度模型的交互

其实很简单，就是在发出 SYSCALL 之前调用 runtime.entersyscall()，系统调用返回之后调用 runtime.exitsyscall()，通知运行时进行调度。

entersyscall

// Standard syscall entry used by the go syscall library and normal cgo calls.
// syscall 库和 cgo 调用的标准入口
// This is exported via linkname to assembly in the syscall package.
//
//go:nosplit
//go:linkname entersyscall
func entersyscall() {
    reentersyscall(getcallerpc(), getcallersp())
}

reentersyscall

//go:nosplit
func reentersyscall(pc, sp uintptr) {
    _g_ := getg()

    // 禁止抢占
    _g_.m.locks++

    // entersyscall 中不能调用任何会导致栈增长/分裂的函数
    // 通过修改 stackguard0 跳过栈检查 修改 throwsplit 可以使 runtime.newstack() 直接 panic
    _g_.stackguard0 = stackPreempt
    _g_.throwsplit = true

    // 保存执行现场，用于 syscall 之后恢复执行
    save(pc, sp)
    _g_.syscallsp = sp
    _g_.syscallpc = pc
    // 修改 G 的状态 _Grunning -> _Gsyscall
    casgstatus(_g_, _Grunning, _Gsyscall)
    // 检查当前 G 的栈是否异常 比 G 栈的低地址还低 高地址还高 都是异常的 直接 panic
    if _g_.syscallsp < _g_.stack.lo || _g_.stack.hi < _g_.syscallsp {
        systemstack(func() {
            print("entersyscall inconsistent ", hex(_g_.syscallsp), " [", hex(_g_.stack.lo), ",", hex(_g_.stack.hi), "]\n")
            throw("entersyscall")
        })
    }

    // 竞态相关，忽略
    if trace.enabled {
        systemstack(traceGoSysCall)
        // systemstack itself clobbers g.sched.{pc,sp} and we might
        // need them later when the G is genuinely blocked in a
        // syscall
        save(pc, sp)
    }

    if atomic.Load(&sched.sysmonwait) != 0 {
        systemstack(entersyscall_sysmon)
        save(pc, sp)
    }

    if _g_.m.p.ptr().runSafePointFn != 0 {
        // runSafePointFn may stack split if run on this stack
        systemstack(runSafePointFn)
        save(pc, sp)
    }

    _g_.m.syscalltick = _g_.m.p.ptr().syscalltick
    _g_.sysblocktraced = true
    // 解绑 P 和 M 通过设置 pp.m = 0 , _g_.m.p = 0
    pp := _g_.m.p.ptr()
    pp.m = 0
    // 将当前的 P 设置到 m 的 oldp 注意这个会在退出系统调用时快速恢复时使用
    _g_.m.oldp.set(pp)
    _g_.m.p = 0
    // 原子修改 P 的 状态为 _Psyscall
    atomic.Store(&pp.status, _Psyscall)
    if sched.gcwaiting != 0 {
        systemstack(entersyscall_gcwait)
        save(pc, sp)
    }

    _g_.m.locks--
}

进入系统调用之前大体执行的流程就是这些，保存执行现场，用于 syscall 之后恢复执行，修改 G 和 P 的状态为_Gsyscall、_Psyscall，解绑 P 和 M，注意这里的 GMP 状态，Go 发起 syscall 的时候执行该 G 的 M 会阻塞然后被OS调度走，P 什么也不干，sysmon 最慢要10-20ms才能发现这个阻塞。这里在我之前的文章有写，Go语言调度模型G、M、P的数量多少合适？，可以看看 GO 调度器的迟钝。

exitsyscall

//go:nosplit
//go:nowritebarrierrec
//go:linkname exitsyscall
func exitsyscall() {
    _g_ := getg()

    // 禁止抢占
    _g_.m.locks++ // see comment in entersyscall
    // 检查栈合法
    if getcallersp() > _g_.syscallsp {
        throw("exitsyscall: syscall frame is no longer valid")
    }

    _g_.waitsince = 0
    // 取出 oldp 这个在进入系统调用前设置的，顺便置为 0
    oldp := _g_.m.oldp.ptr()
    _g_.m.oldp = 0
    // 尝试快速退出系统调用
    if exitsyscallfast(oldp) {
        if trace.enabled {
            if oldp != _g_.m.p.ptr() || _g_.m.syscalltick != _g_.m.p.ptr().syscalltick {
                systemstack(traceGoStart)
            }
        }
        // There's a cpu for us, so we can run.
        _g_.m.p.ptr().syscalltick++
        // We need to cas the status and scan before resuming...原子修改 G 的状态 _Gsyscall -> _Grunning
        casgstatus(_g_, _Gsyscall, _Grunning)

        // Garbage collector isn't running (since we are),
        // so okay to clear syscallsp.
        _g_.syscallsp = 0
        _g_.m.locks--
        // 恢复 G 的栈信息， stackguard0 和 throwsplit 是在 entersyscall 那里改的
        if _g_.preempt {
            // restore the preemption request in case we've cleared it in newstack
            _g_.stackguard0 = stackPreempt
        } else {
            // otherwise restore the real _StackGuard, we've spoiled it in entersyscall/entersyscallblock
            _g_.stackguard0 = _g_.stack.lo + _StackGuard
        }
        _g_.throwsplit = false

        if sched.disable.user && !schedEnabled(_g_) {
            // Scheduling of this goroutine is disabled.
            Gosched()
        }

        return
    }

    _g_.sysexitticks = 0
    if trace.enabled {
        // Wait till traceGoSysBlock event is emitted.
        // This ensures consistency of the trace (the goroutine is started after it is blocked).
        for oldp != nil && oldp.syscalltick == _g_.m.syscalltick {
            osyield()
        }
        // We can't trace syscall exit right now because we don't have a P.
        // Tracing code can invoke write barriers that cannot run without a P.
        // So instead we remember the syscall exit time and emit the event
        // in execute when we have a P.
        _g_.sysexitticks = cputicks()
    }

    _g_.m.locks--

    // Call the scheduler. 切换到 g0 栈 调用 schedule 进入调度循环
    mcall(exitsyscall0)

    // Scheduler returned, so we're allowed to run now.
    // Delete the syscallsp information that we left for
    // the garbage collector during the system call.
    // Must wait until now because until gosched returns
    // we don't know for sure that the garbage collector
    // is not running.
    _g_.syscallsp = 0
    _g_.m.p.ptr().syscalltick++
    _g_.throwsplit = false
}

//go:nosplit
func exitsyscallfast(oldp *p) bool {
    _g_ := getg()

    // Freezetheworld sets stopwait but does not retake P's.
    if sched.stopwait == freezeStopWait {
        return false
    }

    // Try to re-acquire the last P. 尝试获取进入系统调用之前就使用的那个 P
    if oldp != nil && oldp.status == _Psyscall && atomic.Cas(&oldp.status, _Psyscall, _Pidle) {
        // There's a cpu for us, so we can run. 刚好之前的 P 还在（没有被 sysmon 中被抢占） 就可以直接运行了
        // wirep 就是将 M 和 P 绑定，修改 p 的状态 为 _Prunning 状态
        wirep(oldp)
        // 计数，忽略
        exitsyscallfast_reacquired()
        return true
    }

    // Try to get any other idle P. 之前 P 没有获取到，就尝试获取其他闲置的 P
    if sched.pidle != 0 {
        var ok bool
        systemstack(func() {
            // exitsyscallfast_pidle() 会检查空闲的 P 列表 如果存在就调用 acquirep() -> wirep()，绑定好 M 和 P 并返回 true
            ok = exitsyscallfast_pidle()
            if ok && trace.enabled {
                if oldp != nil {
                    // Wait till traceGoSysBlock event is emitted.
                    // This ensures consistency of the trace (the goroutine is started after it is blocked).
                    for oldp.syscalltick == _g_.m.syscalltick {
                        osyield()
                    }
                }
                traceGoSysExit(0)
            }
        })
        if ok {
            return true
        }
    }
    return false
}

// wirep is the first step of acquirep, which actually associates the
// current M to _p_. This is broken out so we can disallow write
// barriers for this part, since we don't yet have a P.
//
//go:nowritebarrierrec
//go:nosplit
func wirep(_p_ *p) {
    _g_ := getg()

    if _g_.m.p != 0 {
        throw("wirep: already in go")
    }
    // 检查 p 不存在 m，并检查要获取的 p 的状态
    if _p_.m != 0 || _p_.status != _Pidle {
        id := int64(0)
        if _p_.m != 0 {
            id = _p_.m.ptr().id
        }
        print("wirep: p->m=", _p_.m, "(", id, ") p->status=", _p_.status, "\n")
        throw("wirep: invalid p state")
    }
    // 将 p 绑定到 m，p 和 m 互相引用
    _g_.m.p.set(_p_)
    _p_.m.set(_g_.m)
    // 修改 p 的状态
    _p_.status = _Prunning
}

//go:nowritebarrierrec
func exitsyscall0(gp *g) {
    _g_ := getg()

    // 修改 G 的状态 _Gsyscall -> _Grunnable
    casgstatus(gp, _Gsyscall, _Grunnable)
    dropg()
    lock(&sched.lock)
    var _p_ *p
    if schedEnabled(_g_) {
        _p_ = pidleget()
    }
    if _p_ == nil {
        // 没有 P 放到全局队列 等调度
        globrunqput(gp)
    } else if atomic.Load(&sched.sysmonwait) != 0 {
        atomic.Store(&sched.sysmonwait, 0)
        notewakeup(&sched.sysmonnote)
    }
    unlock(&sched.lock)
    if _p_ != nil {
        // 有 P 就用这个 P 了 直接执行了 然后还是调度循环
        acquirep(_p_)
        execute(gp, false) // Never returns.
    }
    if _g_.m.lockedg != 0 {
        // Wait until another thread schedules gp and so m again.
        // 设置了 LockOsThread 的 g 的特殊逻辑
        stoplockedm()
        execute(gp, false) // Never returns.
    }
    stopm()      // 将 M 停止放到闲置列表直到有新的任务执行
    schedule() // Never returns.
}

退出系统调用就很单纯，各种找 P 来执行 syscall 之后的逻辑，如果实在没有 P 就修改 G 的状态为 _Grunnable 放到全局队列等待调度，顺便调用 stopm() 将 M 停止放到闲置列表直到有新的任务执行。

entersyscallblock

和 entersyscall() 一样，已经明确知道是阻塞的 syscall，不用等 sysmon 去抢占 P 直接调用entersyscallblock_handoff -> handoffp(releasep())，直接就把 p 交出来了

// The same as entersyscall(), but with a hint that the syscall is blocking.
//go:nosplit
func entersyscallblock() {
    ...

    systemstack(entersyscallblock_handoff)

    ...
}

func entersyscallblock_handoff() {
    if trace.enabled {
        traceGoSysCall()
        traceGoSysBlock(getg().m.p.ptr())
    }
    handoffp(releasep())
}

总结，syscall 包提供的系统调用可以通过 entersyscall 和 exitsyscall 和 runtime 保持互动，让调度模型充分发挥作用，runtime 包自己实现的 syscall 保留了自己的特权，在执行自己的逻辑的时候，我的 P 不会被调走，这样保证了在 Go 自己“底层”使用的这些 syscall 返回之后都能被立刻处理。

所以同样是 epollwait，runtime 用的是不能被别人打断的，你用的 syscall.EpollWait 那显然是没有这种特权的。
个人学习笔记，方便自己复习，有不对的地方欢迎评论哈！

参考资料

wiki
Linux系统调用(syscall)原理
系统调用