Go scheduler 源码分析

1、进程/线程/协程基本概念

一个进程可以有多个线程,一般情况下固定2MB内存块来做栈,用来保存当前被调用/挂起的函数内部的变量,CPU在执行调度的时候切换的是线程,如果下一个线程也是当前进程的,就只有线程切换,“很快”就能完成;如果下一个线程不是当前的进程,就需要切换进程,这就得费点时间了。

线程分为内核态线程用户态线程,用户态线程需要绑定内核态线程,CPU并不能感知用户态线程的存在,它只知道它在运行1个线程,这个线程实际是内核态线程。

用户态线程实际有个名字叫协程(co-routine),为了容易区分,我们使用协程指用户态线程,使用线程指内核态线程。

协程跟线程是有区别的,线程由CPU调度是抢占式的,协程由用户态调度是协作式的,一个协程让出CPU后,才执行下一个协程。

协程和线程绑定关系有以下3种:

N:1,N个协程绑定1个线程,优点就是协程在用户态线程即完成切换,不会陷入到内核态,这种切换非常的轻量快速。但也有很大的缺点,1个进程的所有协程都绑定在1个线程上,一是某个程序用不了硬件的多核加速能力,二是一旦某协程阻塞,造成线程阻塞,本进程的其他协程都无法执行了,根本就没有并发的能力了。

1:1,1个协程绑定1个线程,这种最容易实现。协程的调度都由CPU完成了,不存在N:1缺点,但有一个缺点是协程的创建、删除和切换的代价都由CPU完成,有点略显昂贵了。

M:N,M个协程绑定N个线程,是N:1和1:1类型的结合,克服了以上2种模型的缺点,但实现起来最为复杂。

2、Golang简介

2.1 Goroutine 概念

因为线程切换需要很大的上下文,这种切换消耗了大量CPU时间,所以Go的并行单元并不是传统意义上的线程,而是采用更轻量的协程(goroutine)来处理,大大提高了并行度,因此Go被称为“最并行的语言”。

2.2与其他并发模型的对比

Python等解释性语言采用的是多进程并发模型,进程的上下文是最大的,所以切换耗费巨大,同时由于多进程通信只能用socket通讯,或者专门设置共享内存,给编程带来了极大的困扰与不便;

C++等语言通常会采用多线程并发模型,相比进程,线程的上下文要小很多,而且多个线程之间本来就是共享内存的,所以编程相比要轻松很多。但是线程的启动和销毁,切换依然要耗费大量CPU时间;于是出现了线程池技术,将线程先储存起来,保持一定的数量,来避免频繁开启/关闭线程的时间消耗,但是这种初级的技术存在一些问题,比如有线程一直被IO阻塞,这样的话这个线程一直占据着坑位,导致后面的任务排不到队,拿不到线程来执行;

Go的并发较为复杂,Go采用了更轻量的数据结构来代替线程,这种数据结构相比线程更轻量,他有自己的栈,切换起来更快。然而真正执行并发的还是线程,Go通过调度器将goroutine调度到线程中执行,并适时地释放和创建新的线程,并且当一个正在运行的goroutine进入阻塞(常见场景就是等待IO)时,将其脱离占用的线程,将其他准备好运行的goroutine放在该线程上执行。通过较为复杂的调度手段,使得整个系统获得极高的并行度同时又不耗费大量的CPU资源。

2.3 Goroutine的特点

非阻塞。Goroutine的引入是为了方便高并发程序的编写。一个Goroutine在进行阻塞操作(比如系统调用)时,会把当前线程中的其他Goroutine移交到其他线程中继续执行,从而避免了整个程序的阻塞。

调度器。虽然Golang引入了垃圾回收(gc),在执行gc时就要求Goroutine是停止的,但Go通过自己实现调度器,也可以方便的实现该功能。 通过多个Goroutine来实现并发程序,既有异步IO的优势,又具有多线程、多进程编写程序的便利性。

自己维护堆栈。当然引入Goroutine,也意味着引入了极大的复杂性。一个Goroutine既要包含要执行的代码,又要包含用于执行该代码的栈、PC(PC值=当前程序执行位置+8)和SP指针。堆栈指针需要保证各种模式下程序完成性。

既然每个Goroutine都有自己的栈,那么在创建Goroutine时,就要同时创建对应的栈。Goroutine在执行时,栈空间会不停增长。栈通常是连续增长的,由于每个进程中的各个线程共享虚拟内存空间,当有多个线程时,就需要为每个线程分配不同起始地址的栈。这就需要在分配栈之前先预估每个线程栈的大小。如果线程数量非常多,就很容易栈溢出。

为了解决这个问题,就有了Split Stacks 技术:创建栈时,只分配一块比较小的内存,如果进行某次函数调用导致栈空间不足时,就会在其他地方分配一块新的栈空间。新的空间不需要和老的栈空间连续。函数调用的参数会拷贝到新的栈空间中,接下来的函数执行都在新栈空间中进行。Golang的栈管理方式与此类似,但是为了更高的效率,使用了连续栈( Golang连续栈) 实现方式也是先分配一块固定大小的栈,在栈空间不足时,分配一块更大的栈,并把旧的栈全部拷贝到新栈中。这样避免了Split Stacks方法可能导致的频繁内存分配和释放。

Goroutine的执行是可以被抢占的。如果一个Goroutine一直占用CPU,长时间没有被调度过,就会被runtime抢占掉,把CPU时间交给其他Goroutine。 这个可以通过 debug/goroutine 阻塞实现。

2.4 结构体

M:指go中的工作者线程,是真正执行代码的单元;

P:是一种调度goroutine的上下文,goroutine依赖于P进行调度,P是真正的并行单元;

G:即goroutine,是go语言中的一段代码(以一个函数的形式展现),最小的并行单元;

P必须绑定在M上才能运行,M必须绑定了P才能运行,而一般情况下,最多有MAXPROCS(通常等于CPU数量)个P,但是可能有很多个M,真正运行的只有绑定了M的P,所以P是真正的并行单元。

每个P有一个自己的runnableG队列,可以从里面拿出一个G来运行,同时也有一个全局的runnable G队列,G通过P依附在M上面执行。不单独使用全局的runnable G队列的原因是,分布式的队列有利于减小临界区大小,想一想多个线程同时请求可用的G的时候,如果只有全局的资源,那么这个全局的锁会导致多少线程一直在等待。

但是如果一个正在执行的G进入了阻塞,典型的例子就是等待IO,那么他和它所在的M会在那边等待,而上下文P会传递到其他可用的M上面,这样这个阻塞就不会影响程序的并行度。

G结构体

typegstruct{// Stack parameters.// stack describes the actual stack memory: [stack.lo, stack.hi).// stackguard0 is the stack pointer compared in the Go stack growth prologue.// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.// stackguard1 is the stack pointer compared in the C stack growth prologue.// It is stack.lo+StackGuard on g0 and gsignal stacks.// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).stack      stack// offset known to runtime/cgo //描述了真实的栈内存,包括上下界、stackguard0uintptr// offset known to liblinkstackguard1uintptr// offset known to liblink_panic        *_panic// innermost panic - offset known to liblink_defer        *_defer// innermost deferm              *m// current m; offset known to arm liblink  //当前的Msched          gobuf//goroutine切换时,用于保存g的上下文syscallspuintptr// if status==Gsyscall, syscallsp = sched.sp to use during gcsyscallpcuintptr// if status==Gsyscall, syscallpc = sched.pc to use during gcstktopspuintptr// expected sp at top of stack, to check in tracebackparam          unsafe.Pointer// passed parameter on wakeup 用于传递参数,睡眠时 其他goroutine可以设置param,唤醒时该goroutine可以获取atomicstatusuint32stackLockuint32// sigprof/scang lock;TODO:fold in to atomicstatusgoidint64//goroutine 的IDwaitsinceint64// approx time when the g become blocked  g被阻塞的 大概时间waitreasonstring// if status==Gwaitingschedlink      guintptr  preemptbool// preemption signal, duplicates stackguard0 = stackpreemptpaniconfaultbool// panic (instead of crash) on unexpected fault addresspreemptscanbool// preempted g does scan for gcgcscandonebool// g has scanned stack; protected by _Gscan bit in statusgcscanvalidbool// false at start of gc cycle, true if G has not run since last scan;TODO:remove?throwsplitbool// must not split stackraceignoreint8// ignore race detection eventssysblocktracedbool// StartTrace has emitted EvGoInSyscall about this goroutinesysexitticksint64// cputicks when syscall has returned (for tracing)tracesequint64// trace event sequencertracelastp    puintptr// last P emitted an event for this goroutinelockedm        muintptr//G被锁定只能在这个M运行siguint32writebuf      []bytesigcode0uintptrsigcode1uintptrsigpcuintptrgopcuintptr// pc of go statement that created this goroutinestartpcuintptr// pc of goroutine functionracectxuintptrwaiting        *sudog// sudog structures this g is waiting on (that have a valid elem ptr); in lock ordercgoCtxt        []uintptr// cgo traceback contextlabels        unsafe.Pointer// profiler labelstimer          *timer// cached timer for time.SleepselectDoneuint32// are we participating in a select and did someone win the race?// Per-G GC state// gcAssistBytes is this G's GC assist credit in terms of// bytes allocated. If this is positive, then the G has credit// to allocate gcAssistBytes bytes without assisting. If this// is negative, then the G must correct this by performing// scan work. We track this in bytes to make it fast to update// and check for debt in the malloc hot path. The assist ratio// determines how this corresponds to scan work debt.gcAssistBytesint64}

Gobuf结构体

typegobuf struct {spuintptrpcuintptrgguintptrctxtunsafe.Pointerretsys.Uintreglruintptrbpuintptr // for GOEXPERIMENT=framepointer}

其中最主要的当然是sched了,保存了goroutine的上下文。goroutine切换的时候不同于线程有OS来负责这部分数据,而是由一个gobuf对象来保存,这样能够更加轻量级,再来看看gobuf的结构

M结构体

typem struct {g0*g    // 带有调度栈的goroutinegsignal*g        // 处理信号的goroutinetls[6]uintptr // thread-local storagemstartfnfunc()curg*g      // 当前运行的goroutinecaughtsigguintptrppuintptr // 关联p和执行的go代码nextppuintptridint32mallocingint32 // 状态spinningbool // m是否out of workblockedbool // m是否被阻塞inwbbool // m是否在执行写屏蔽printlockint8incgobool // m在执行cgo吗fastranduint32ncgocalluint64      // cgo调用的总数ncgoint32      // 当前cgo调用的数目parknotealllink*m // 用于链接allmschedlinkmuintptrmcache*mcache // 当前m的内存缓存lockedg*g // 锁定g在当前m上执行,而不会切换到其他mcreatestack[32]uintptr // thread创建的栈}

结构体M中有两个G是需要关注一下的:

一个是curg,代表结构体M当前绑定的结构体G。

另一个是g0,是带有调度栈的goroutine,这是一个比较特殊的goroutine。普通的goroutine的栈是在堆上分配的可增长的栈,而g0的栈是M对应的线程的栈。所有调度相关的代码,会先切换到该goroutine的栈中再执行。也就是说线程的栈也是用的g实现,而不是使用的OS的。

P结构体

typep struct {lockmutexidint32statusuint32 // 状态,可以为pidle/prunning/...linkpuintptrschedtickuint32    // 每调度一次加1syscalltickuint32    // 每一次系统调用加1sysmonticksysmontickmmuintptr  // 回链到关联的mmcache*mcacheracectxuintptrgoidcacheuint64 // goroutine的ID的缓存goidcacheenduint64//可运行的goroutine的队列runqheaduint32runqtailuint32runq[256]guintptrrunnextguintptr // 下一个运行的gsudogcache[]*sudogsudogbuf[128]*sudogpallocpersistentAlloc // per-P to avoid mutexpad[sys.CacheLineSize]byte}

其中P的状态有Pidle, Prunning, Psyscall, Pgcstop, Pdead;在其内部队列runqhead里面有可运行的goroutine,P优先从内部获取执行的g,这样能够提高效率。

Schedt结构体

typeschedtstruct{  goidgenuint64lastpolluint64lock mutex    midle        muintptr// idle状态的mnmidleint32// idle状态的m个数nmidlelockedint32// lockde状态的m个数mcountint32// 创建的m的总数maxmcountint32// m允许的最大个数ngsysuint32// 系统中goroutine的数目,会自动更新pidle      puintptr// idle的pnpidleuint32nmspinninguint32// 全局的可运行的g队列runqhead guintptr    runqtail guintptr    runqsizeint32// dead的G的全局缓存gflock      mutex    gfreeStack  *g    gfreeNoStack *g    ngfreeint32// sudog的缓存中心sudoglock  mutex    sudogcache *sudog}

大多数需要的信息都已放在了结构体M、G和P中,schedt结构体只是一个壳。可以看到,其中有M的idle队列,P的idle队列,以及一个全局的就绪的G队列。schedt结构体中的Lock是非常必须的,如果M或P等做一些非局部的操作,它们一般需要先锁住调度器。

2.5具体函数

goroutine调度器的代码在/src/runtime/proc.go中,一些比较关键的函数分析如下。

2.5.1 schedule函数

schedule函数在runtime需要进行调度时执行,为当前的P寻找一个可以运行的G并执行它,寻找顺序如下:

1) 调用runqget函数来从P自己的runnable G队列中得到一个可以执行的G;

2) 如果1)失败,则调用findrunnable函数去寻找一个可以执行的G;

3) 如果2)也没有得到可以执行的G,那么结束调度,从上次的现场继续执行。

4) 注意)//偶尔会先检查一次全局可运行队列,以确保公平性。否则,两个goroutine可以完全占用本地runqueue。 通过 schedtick计数 %61来保证

代码如下:

// One round of scheduler: find a runnable goroutine and execute it.// Never returns.funcschedule(){  _g_ := getg()if_g_.m.locks !=0{      throw("schedule: holding locks")  }if_g_.m.lockedg !=0{      stoplockedm()      execute(_g_.m.lockedg.ptr(),false)// Never returns.}// We should not schedule away from a g that is executing a cgo call,// since the cgo call is using the m's g0 stack.if_g_.m.incgo {      throw("schedule: in cgo")  } top:ifsched.gcwaiting !=0{      gcstopm()gototop  }if_g_.m.p.ptr().runSafePointFn !=0{      runSafePointFn()  }vargp *gvarinheritTimebooliftrace.enabled || trace.shutdown {      gp = traceReader()ifgp !=nil{        casgstatus(gp, _Gwaiting, _Grunnable)        traceGoUnpark(gp,0)      }  }ifgp ==nil&& gcBlackenEnabled !=0{      gp = gcController.findRunnableGCWorker(_g_.m.p.ptr())  }ifgp ==nil{// Check the global runnable queue once in a while to ensure fairness.// Otherwise two goroutines can completely occupy the local runqueue// by constantly respawning each other.if_g_.m.p.ptr().schedtick%61==0&& sched.runqsize >0{        lock(&sched.lock)        gp = globrunqget(_g_.m.p.ptr(),1)        unlock(&sched.lock)      }  }ifgp ==nil{      gp, inheritTime = runqget(_g_.m.p.ptr())ifgp !=nil&& _g_.m.spinning {        throw("schedule: spinning with local work")      }  }ifgp ==nil{      gp, inheritTime = findrunnable()// blocks until work is available}// This thread is going to run a goroutine and is not spinning anymore,// so if it was marked as spinning we need to reset it now and potentially// start a new spinning M.if_g_.m.spinning {      resetspinning()  }ifgp.lockedm !=0{// Hands off own p to the locked m,// then blocks waiting for a new p.startlockedm(gp)gototop  }    execute(gp, inheritTime)}

2.5.2 findrunnable函数

findrunnable函数负责给一个P寻找可以执行的G,它的寻找顺序如下:

1) 调用runqget函数来从P自己的runnable G队列中得到一个可以执行的G;

2) 如果1)失败,调用globrunqget函数从全局runnableG队列中得到一个可以执行的G;

3) 如果2)失败,调用netpoll(非阻塞)函数取一个异步回调的G

4) 如果3)失败,尝试从其他P那里偷取一半数量的G过来;

5) 如果4)失败,再次调用globrunqget函数从全局runnableG队列中得到一个可以执行的G;

6) 如果5)失败,调用netpoll(阻塞)函数取一个异步回调的G;

7) 如果6)仍然没有取到G,那么调用stopm函数停止这个M。

代码如下:

// Finds a runnable goroutine to execute.// Tries to steal from other P's, get g from global queue, poll network.funcfindrunnable()(gp *g, inheritTimebool){  _g_ := getg()// The conditions here and in handoffp must agree: if// findrunnable would return a G to run, handoffp must start// an M.top:  _p_ := _g_.m.p.ptr()ifsched.gcwaiting !=0{      gcstopm()gototop  }if_p_.runSafePointFn !=0{      runSafePointFn()  }iffingwait && fingwake {ifgp := wakefing(); gp !=nil{        ready(gp,0,true)      }  }if*cgo_yield !=nil{      asmcgocall(*cgo_yield,nil)  }// local runqifgp, inheritTime := runqget(_p_); gp !=nil{returngp, inheritTime  }// global runqifsched.runqsize !=0{      lock(&sched.lock)      gp := globrunqget(_p_,0)      unlock(&sched.lock)ifgp !=nil{returngp,false}  }// Poll network.// This netpoll is only an optimization before we resort to stealing.// We can safely skip it if there are no waiters or a thread is blocked// in netpoll already. If there is any kind of logical race with that// blocked thread (e.g. it has already returned from netpoll, but does// not set lastpoll yet), this thread will do blocking netpoll below// anyway.ifnetpollinited() && atomic.Load(&netpollWaiters) >0&& atomic.Load64(&sched.lastpoll) !=0{ifgp := netpoll(false); gp !=nil{// non-blocking// netpoll returns list of goroutines linked by schedlink.injectglist(gp.schedlink.ptr())        casgstatus(gp, _Gwaiting, _Grunnable)iftrace.enabled {            traceGoUnpark(gp,0)        }returngp,false}  }// Steal work from other P's.procs :=uint32(gomaxprocs)ifatomic.Load(&sched.npidle) == procs-1{// Either GOMAXPROCS=1 or everybody, except for us, is idle already.// New work can appear from returning syscall/cgocall, network or timers.// Neither of that submits to local run queues, so no point in stealing.gotostop  }// If number of spinning M's >= number of busy P's, block.// This is necessary to prevent excessive CPU consumption// when GOMAXPROCS>>1 but the program parallelism is low.if!_g_.m.spinning &&2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {gotostop  }if!_g_.m.spinning {      _g_.m.spinning =trueatomic.Xadd(&sched.nmspinning,1)  }fori :=0; i <4; i++ {forenum := stealOrder.start(fastrand()); !enum.done(); enum.next() {ifsched.gcwaiting !=0{gototop        }        stealRunNextG := i >2// first look for ready queues with more than 1 gifgp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp !=nil{returngp,false}      }  } stop:// We have nothing to do. If we're in the GC mark phase, can// safely scan and blacken objects, and have work to do, run// idle-time marking rather than give up the P.ifgcBlackenEnabled !=0&& _p_.gcBgMarkWorker !=0&& gcMarkWorkAvailable(_p_) {      _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode      gp := _p_.gcBgMarkWorker.ptr()      casgstatus(gp, _Gwaiting, _Grunnable)iftrace.enabled {        traceGoUnpark(gp,0)      }returngp,false}// Before we drop our P, make a snapshot of the allp slice,// which can change underfoot once we no longer block// safe-points. We don't need to snapshot the contents because// everything up to cap(allp) is immutable.allpSnapshot := allp// return P and blocklock(&sched.lock)ifsched.gcwaiting !=0|| _p_.runSafePointFn !=0{      unlock(&sched.lock)gototop  }ifsched.runqsize !=0{      gp := globrunqget(_p_,0)      unlock(&sched.lock)returngp,false}ifreleasep() != _p_ {      throw("findrunnable: wrong p")  }  pidleput(_p_)  unlock(&sched.lock)// Delicate dance: thread transitions from spinning to non-spinning state,// potentially concurrently with submission of new goroutines. We must// drop nmspinning first and then check all per-P queues again (with// #StoreLoad memory barrier in between). If we do it the other way around,// another thread can submit a goroutine after we've checked all run queues// but before we drop nmspinning; as the result nobody will unpark a thread// to run the goroutine.// If we discover new work below, we need to restore m.spinning as a signal// for resetspinning to unpark a new worker thread (because there can be more// than one starving goroutine). However, if after discovering new work// we also observe no idle Ps, it is OK to just park the current thread:// the system is fully loaded so no spinning threads are required.// Also see "Worker thread parking/unparking" comment at the top of the file.wasSpinning := _g_.m.spinningif_g_.m.spinning {      _g_.m.spinning =falseifint32(atomic.Xadd(&sched.nmspinning,-1)) <0{        throw("findrunnable: negative nmspinning")      }  }// check all runqueues once againfor_, _p_ :=rangeallpSnapshot {if!runqempty(_p_) {        lock(&sched.lock)        _p_ = pidleget()        unlock(&sched.lock)if_p_ !=nil{            acquirep(_p_)ifwasSpinning {              _g_.m.spinning =trueatomic.Xadd(&sched.nmspinning,1)            }gototop        }break}  }// Check for idle-priority GC work again.ifgcBlackenEnabled !=0&& gcMarkWorkAvailable(nil) {      lock(&sched.lock)      _p_ = pidleget()if_p_ !=nil&& _p_.gcBgMarkWorker ==0{        pidleput(_p_)        _p_ =nil}      unlock(&sched.lock)if_p_ !=nil{        acquirep(_p_)ifwasSpinning {            _g_.m.spinning =trueatomic.Xadd(&sched.nmspinning,1)        }// Go back to idle GC check.gotostop      }  }// poll networkifnetpollinited() && atomic.Load(&netpollWaiters) >0&& atomic.Xchg64(&sched.lastpoll,0) !=0{if_g_.m.p !=0{        throw("findrunnable: netpoll with p")      }if_g_.m.spinning {        throw("findrunnable: netpoll with spinning")      }      gp := netpoll(true)// block until new work is availableatomic.Store64(&sched.lastpoll,uint64(nanotime()))ifgp !=nil{        lock(&sched.lock)        _p_ = pidleget()        unlock(&sched.lock)if_p_ !=nil{            acquirep(_p_)            injectglist(gp.schedlink.ptr())            casgstatus(gp, _Gwaiting, _Grunnable)iftrace.enabled {              traceGoUnpark(gp,0)            }returngp,false}        injectglist(gp)      }  }  stopm()gototop}

2.5.3 newproc函数

newproc函数负责创建一个可以运行的G并将其放在当前的P的runnable G队列中,它是类似”go func() { … }”语句真正被编译器翻译后的调用,核心代码在newproc1函数。这个函数执行顺序如下:

1) 获得当前的G所在的 P,然后从free G队列中取出一个G;

2) 如果1)取到则对这个G进行参数配置,否则新建一个G;

3) 将G加入P的runnable G队列。

代码如下:

// Go1.10.8版本默认stack大小为2KB_StackMin =2048// 创建一个g对象,然后放到g队列// 等待被执行// Create a new g running fn with narg bytes of arguments starting// at argp. callerpc is the address of the go statement that created// this. The new g is put on the queue of g's waiting to run.funcnewproc1(fn *funcval, argp *uint8, nargint32, callerpcuintptr){  _g_ := getg()iffn ==nil{      _g_.m.throwing =-1// do not dump full stacksthrow("go of nil func value")  }  _g_.m.locks++// disable preemption because it can be holding p in a local varsiz := narg  siz = (siz +7) &^7// We could allocate a larger initial stack if necessary.// Not worth it: this is almost always an error.// 4*sizeof(uintreg): extra space added below// sizeof(uintreg): caller's LR (arm) or return address (x86, in gostartcall).ifsiz >= _StackMin-4*sys.RegSize-sys.RegSize {      throw("newproc: function arguments too large for new goroutine")  }    _p_ := _g_.m.p.ptr()  newg := gfget(_p_)ifnewg ==nil{      newg = malg(_StackMin)      casgstatus(newg, _Gidle, _Gdead)      allgadd(newg)// publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.}ifnewg.stack.hi ==0{      throw("newproc1: newg missing stack")  }ifreadgstatus(newg) != _Gdead {      throw("newproc1: new g is not Gdead")  }    totalSize :=4*sys.RegSize +uintptr(siz) + sys.MinFrameSize// extra space in case of reads slightly beyond frametotalSize += -totalSize & (sys.SpAlign -1)// align to spAlignsp := newg.stack.hi - totalSize  spArg := spifusesLR {// caller's LR*(*uintptr)(unsafe.Pointer(sp)) =0prepGoExitFrame(sp)      spArg += sys.MinFrameSize  }ifnarg >0{      memmove(unsafe.Pointer(spArg), unsafe.Pointer(argp),uintptr(narg))// This is a stack-to-stack copy. If write barriers// are enabled and the source stack is grey (the// destination is always black), then perform a// barrier copy. We do this *after* the memmove// because the destination stack may have garbage on// it.ifwriteBarrier.needed && !_g_.m.curg.gcscandone {        f := findfunc(fn.fn)        stkmap := (*stackmap)(funcdata(f, _FUNCDATA_ArgsPointerMaps))// We're in the prologue, so it's always stack map index 0.bv := stackmapdata(stkmap,0)        bulkBarrierBitmap(spArg, spArg,uintptr(narg),0, bv.bytedata)      }  }    memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))  newg.sched.sp = sp  newg.stktopsp = sp  newg.sched.pc = funcPC(goexit) + sys.PCQuantum// +PCQuantum so that previous instruction is in same functionnewg.sched.g = guintptr(unsafe.Pointer(newg))  gostartcallfn(&newg.sched, fn)  newg.gopc = callerpc  newg.startpc = fn.fnif_g_.m.curg !=nil{      newg.labels = _g_.m.curg.labels  }ifisSystemGoroutine(newg) {      atomic.Xadd(&sched.ngsys, +1)  }  newg.gcscanvalid =falsecasgstatus(newg, _Gdead, _Grunnable)if_p_.goidcache == _p_.goidcacheend {// Sched.goidgen is the last allocated id,// this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].// At startup sched.goidgen=0, so main goroutine receives goid=1._p_.goidcache = atomic.Xadd64(&sched.goidgen, _GoidCacheBatch)      _p_.goidcache -= _GoidCacheBatch -1_p_.goidcacheend = _p_.goidcache + _GoidCacheBatch  }  newg.goid =int64(_p_.goidcache)  _p_.goidcache++ifraceenabled {      newg.racectx = racegostart(callerpc)  }iftrace.enabled {      traceGoCreate(newg, newg.startpc)  }  runqput(_p_, newg,true)ifatomic.Load(&sched.npidle) !=0&& atomic.Load(&sched.nmspinning) ==0&& mainStarted {      wakep()  }  _g_.m.locks--if_g_.m.locks ==0&& _g_.preempt {// restore the preemption request in case we've cleared it in newstack_g_.stackguard0 = stackPreempt  }}

2.5.4 goexit0函数

goexit函数是当G退出时调用的。这个函数对G进行一些设置后,将它放入free G列表中,供以后复用,之后调用schedule函数调度。

// goexit continuation on g0.funcgoexit0(gp *g){  _g_ := getg()//设置g的 status从 _Grunning变为 _Gdeadcasgstatus(gp, _Grunning, _Gdead)ifisSystemGoroutine(gp) {      atomic.Xadd(&sched.ngsys,-1)  }//对该g 进行释放设置 基本为nil /0gp.m =nillocked := gp.lockedm !=0gp.lockedm =0_g_.m.lockedg =0gp.paniconfault =falsegp._defer =nil// should be true already but just in case.gp._panic =nil// non-nil for Goexit during panic. points at stack-allocated data.gp.writebuf =nilgp.waitreason =""gp.param =nilgp.labels =nilgp.timer =nilifgcBlackenEnabled !=0&& gp.gcAssistBytes >0{// Flush assist credit to the global pool. This gives// better information to pacing if the application is// rapidly creating an exiting goroutines.scanCredit :=int64(gcController.assistWorkPerByte *float64(gp.gcAssistBytes))      atomic.Xaddint64(&gcController.bgScanCredit, scanCredit)      gp.gcAssistBytes =0}// Note that gp's stack scan is now "valid" because it has no// stack.gp.gcscanvalid =truedropg()if_g_.m.lockedInt !=0{print("invalid m->lockedInt = ", _g_.m.lockedInt,"\n")      throw("internal lockOSThread error")  }  _g_.m.lockedExt =0//把这个g 推到free G 列表gfput(_g_.m.p.ptr(), gp)iflocked {// The goroutine may have locked this thread because// it put it in an unusual kernel state. Kill it// rather than returning it to the thread pool.// Return to mstart, which will release the P and exit// the thread.ifGOOS !="plan9"{// See golang.org/issue/22227.gogo(&_g_.m.g0.sched)      }  }  schedule()}

2.5.5 handoffp函数

handoffp函数将P从系统调用或阻塞的M中传递出去,如果P还有runnable G队列,那么新开一个M,调用startm函数,新开的M不空旋。

// Hands off P from syscall or locked M.// Always runs without a P, so write barriers are not allowed.//go:nowritebarrierrecfunchandoffp(_p_ *p){// handoffp must start an M in any situation where// findrunnable would return a G to run on _p_.//如果这个P的队列不为空或调度内的size不为空 那么 进行startm 且不空旋if!runqempty(_p_) || sched.runqsize !=0{      startm(_p_,false)return}//如果正在进行GC处理  同上ifgcBlackenEnabled !=0&& gcMarkWorkAvailable(_p_) {      startm(_p_,false)return}//如果没活可做了,检查下有没有 空闲/自旋的 M//否则 不需要我们做自旋ifatomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) ==0&& atomic.Cas(&sched.nmspinning,0,1) {//TODO:fast atomicstartm(_p_,true)return}//调度上锁  将这个P 摘除走lock(&sched.lock)ifsched.gcwaiting !=0{      _p_.status = _Pgcstop      sched.stopwait--ifsched.stopwait ==0{        notewakeup(&sched.stopnote)      }      unlock(&sched.lock)return}if_p_.runSafePointFn !=0&& atomic.Cas(&_p_.runSafePointFn,1,0) {      sched.safePointFn(_p_)      sched.safePointWait--ifsched.safePointWait ==0{        notewakeup(&sched.safePointNote)      }  }ifsched.runqsize !=0{      unlock(&sched.lock)      startm(_p_,false)return}// If this is the last running P and nobody is polling network,// need to wakeup another M to poll network.ifsched.npidle == uint32(gomaxprocs-1) && atomic.Load64(&sched.lastpoll) !=0{      unlock(&sched.lock)      startm(_p_,false)return}  pidleput(_p_)  unlock(&sched.lock)}

2.5.6 startm函数

startm函数调度一个M或者必要时创建一个M来运行指定的P。

// Schedules some M to run the p (creates an M if necessary).// If p==nil, tries to get an idle P, if no idle P's does nothing.// May run with m.p==nil, so write barriers are not allowed.// If spinning is set, the caller has incremented nmspinning and startm will// either decrement nmspinning or set m.spinning in the newly started M.//go:nowritebarrierrecfuncstartm(_p_ *p, spinning bool){//加锁lock(&sched.lock)if_p_ ==nil{            _p_ = pidleget()if_p_ ==nil{        unlock(&sched.lock)ifspinning {// The caller incremented nmspinning, but there are no idle Ps,// so it's okay to just undo the increment and give up.ifint32(atomic.Xadd(&sched.nmspinning, -1)) <0{throw("startm: negative nmspinning")            }        }return}  }      mp := mget()  unlock(&sched.lock)ifmp ==nil{varfnfunc()ifspinning {// The caller incremented nmspinning, so set m.spinning in the new M.fn = mspinning      }      newm(fn, _p_)return}ifmp.spinning {throw("startm: m is spinning")  }ifmp.nextp !=0{throw("startm: m has p")  }ifspinning && !runqempty(_p_) {throw("startm: p has runnable gs")  }// The caller incremented nmspinning, so set m.spinning in the new M.mp.spinning = spinning  mp.nextp.set(_p_)  notewakeup(&mp.park)}

2.5.7 sysmon函数

sysmon函数是Go runtime启动时创建的,负责监控所有goroutine的状态,判断是否需要GC,进行netpoll等操作。sysmon函数中会调用retake函数进行抢占式调度。

// Always runs without a P, so write barriers are not allowed.////go:nowritebarrierrecfuncsysmon(){  lock(&sched.lock)  sched.nmsys++  checkdead()  unlock(&sched.lock)// If a heap span goes unused for 5 minutes after a garbage collection,// we hand it back to the operating system.scavengelimit :=int64(5*60*1e9)ifdebug.scavenge >0{// Scavenge-a-lot for testing.forcegcperiod =10*1e6scavengelimit =20*1e6}    lastscavenge := nanotime()  nscavenge :=0lasttrace :=int64(0)  idle :=0// how many cycles in succession we had not wokeup somebodydelay :=uint32(0)for{ifidle ==0{// start with 20us sleep...delay =20}elseifidle >50{// start doubling the sleep after 1ms...delay *=2}ifdelay >10*1000{// up to 10msdelay =10*1000}      usleep(delay)ifdebug.schedtrace <=0&& (sched.gcwaiting !=0|| atomic.Load(&sched.npidle) ==uint32(gomaxprocs)) {        lock(&sched.lock)ifatomic.Load(&sched.gcwaiting) !=0|| atomic.Load(&sched.npidle) ==uint32(gomaxprocs) {            atomic.Store(&sched.sysmonwait,1)            unlock(&sched.lock)// Make wake-up period small enough// for the sampling to be correct.maxsleep := forcegcperiod /2ifscavengelimit < forcegcperiod {              maxsleep = scavengelimit /2}            shouldRelax :=trueifosRelaxMinNS >0{              next := timeSleepUntil()              now := nanotime()ifnext-now < osRelaxMinNS {                  shouldRelax =false}            }ifshouldRelax {              osRelax(true)            }            notetsleep(&sched.sysmonnote, maxsleep)ifshouldRelax {              osRelax(false)            }            lock(&sched.lock)            atomic.Store(&sched.sysmonwait,0)            noteclear(&sched.sysmonnote)            idle =0delay =20}        unlock(&sched.lock)      }// trigger libc interceptors if neededif*cgo_yield !=nil{        asmcgocall(*cgo_yield,nil)      }// poll network if not polled for more than 10mslastpoll :=int64(atomic.Load64(&sched.lastpoll))      now := nanotime()ifnetpollinited() && lastpoll !=0&& lastpoll+10*1000*1000< now {        atomic.Cas64(&sched.lastpoll,uint64(lastpoll),uint64(now))        gp := netpoll(false)// non-blocking - returns list of goroutinesifgp !=nil{// Need to decrement number of idle locked M's// (pretending that one more is running) before injectglist.// Otherwise it can lead to the following situation:// injectglist grabs all P's but before it starts M's to run the P's,// another M returns from syscall, finishes running its G,// observes that there is no work to do and no other running M's// and reports deadlock.incidlelocked(-1)            injectglist(gp)            incidlelocked(1)        }      }// retake P's blocked in syscalls// and preempt long running G'sifretake(now) !=0{        idle =0}else{        idle++      }// check if we need to force a GCift := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) !=0{        lock(&forcegc.lock)        forcegc.idle =0forcegc.g.schedlink =0injectglist(forcegc.g)        unlock(&forcegc.lock)      }// scavenge heap once in a whileiflastscavenge+scavengelimit/2< now {        mheap_.scavenge(int32(nscavenge),uint64(now),uint64(scavengelimit))        lastscavenge = now        nscavenge++      }ifdebug.schedtrace >0&& lasttrace+int64(debug.schedtrace)*1000000<= now {        lasttrace = now        schedtrace(debug.scheddetail >0)      }  }}

2.5.8 retake函数

枚举所有的P 如果P在系统调用中(_Psyscall), 且经过了一次sysmon循环(20us~10ms), 则抢占这个P, 调用handoffp解除M和P之间的关联, 如果P在运行中(_Prunning), 且经过了一次sysmon循环并且G运行时间超过forcePreemptNS(10ms), 则抢占这个P

并设置g.preempt = true,g.stackguard0 = stackPreempt。

为什么设置了stackguard就可以实现抢占?

因为这个值用于检查当前栈空间是否足够, go函数的开头会比对这个值判断是否需要扩张栈。

newstack函数判断g.stackguard0等于stackPreempt, 就知道这是抢占触发的, 这时会再检查一遍是否要抢占。

抢占机制保证了不会有一个G长时间的运行导致其他G无法运行的情况发生。

funcretake(nowint64)uint32{  n :=0// Prevent allp slice changes. This lock will be completely// uncontended unless we're already stopping the world.lock(&allpLock)// We can't use a range loop over allp because we may// temporarily drop the allpLock. Hence, we need to re-fetch// allp each time around the loop.fori :=0; i 0&& pd.syscallwhen+10*1000*1000> now {continue}// Drop allpLock so we can take sched.lock.unlock(&allpLock)// Need to decrement number of idle locked M's// (pretending that one more is running) before the CAS.// Otherwise the M from which we retake can exit the syscall,// increment nmidle and report deadlock.incidlelocked(-1)ifatomic.Cas(&_p_.status, s, _Pidle) {iftrace.enabled {              traceGoSysBlock(_p_)              traceProcStop(_p_)            }            n++            _p_.syscalltick++            handoffp(_p_)        }        incidlelocked(1)        lock(&allpLock)      }elseifs == _Prunning {// Preempt G if it's running for too long.t :=int64(_p_.schedtick)ifint64(pd.schedtick) != t {            pd.schedtick =uint32(t)            pd.schedwhen = nowcontinue}ifpd.schedwhen+forcePreemptNS > now {continue}        preemptone(_p_)      }  }  unlock(&allpLock)returnuint32(n)}

3、调度器总结

3.1 调度器的两大思想

复用线程:协程本身就是运行在一组线程之上,不需要频繁的创建、销毁线程,而是对线程的复用。在调度器中复用线程还有2个体现:1)work stealing,当本线程无可运行的G时,尝试从其他线程绑定的P偷取G,而不是销毁线程。2)handoff,当本线程因为G进行系统调用阻塞时,线程释放绑定的P,把P转移给其他空闲的线程执行。

利用并行:GOMAXPROCS设置P的数量,当GOMAXPROCS大于1时,就最多有GOMAXPROCS个线程处于运行状态,这些线程可能分布在多个CPU核上同时运行,使得并发利用并行。另外,GOMAXPROCS也限制了并发的程度,比如GOMAXPROCS = 核数/2,则最多利用了一半的CPU核进行并行。

3.2调度器的两小策略:

抢占:在coroutine中要等待一个协程主动让出CPU才执行下一个协程,在Go中,一个goroutine最多占用CPU 10ms,防止其他goroutine被饿死,这就是goroutine不同于coroutine的一个地方。

全局G队列:在新的调度器中依然有全局G队列,但功能已经被弱化了,当M执行work stealing从其他P偷不到G时,它可以从全局G队列获取G。

4、参考资料

Golang代码仓库:https://github.com/golang/go

《ScalableGo Schedule》:https://docs.google.com/docum...

《GoPreemptive Scheduler》:https://docs.google.com/docum...

网上文章:

https://studygolang.com/artic...

https://studygolang.com/artic...

https://studygolang.com/artic...

https://studygolang.com/artic...

https://studygolang.com/artic... 调度实例分析 

https://www.cnblogs.com/sunsk... 抢占式

https://blog.csdn.net/u010853... schedule 剖析理解 分析的很到位--建议大家认真阅读几遍-因为图形很形象。

你可能感兴趣的:(Go scheduler 源码分析)