Kubernetes 1.12.0 Kube-controller-manager之node-lifecycle-controller源码阅读分析

前言

Kube-controller-manager组件最终启动了很多controller,其中跟node有关的有node-ipam-controller和node-lifecycle-controller(注:Kubernetes 1.10前的版本只有node-controller,1.10版本开始将node-controller解耦为node-ipam-controller和node-lifecycle-controller),关于node-ipam-controller的源码阅读可以看我之前的博客Kubernetes 1.12.0 Kube-controller-manager之node-lifecycle-controller源码阅读分析,本文将对node-lifecycle-controller的源码阅读分析。

startNodeLifecycleController函数

startNodeLifecycleController函数是Kube-controller-manager启动node-lifecycle-controller的入口,可以看到该函数就两个步骤。

  • 调用lifecyclecontroller.NewNodeLifecycleController函数生成node-lifecycle-controller的实例
  • 调用lifecycleController.Run方法,执行node-lifecycle-controller
k8s.io/kubernetes/cmd/kube-controller-manager/app/core.go:122

func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
   lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
      ctx.InformerFactory.Core().V1().Pods(),
      ctx.InformerFactory.Core().V1().Nodes(),
      ctx.InformerFactory.Extensions().V1beta1().DaemonSets(),
      ctx.Cloud,
      ctx.ClientBuilder.ClientOrDie("node-controller"),
      ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,
      ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
      ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
      ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
      ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
      ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
      ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
      ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
      ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
      utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
      utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition),
   )
   if err != nil {
      return nil, true, err
   }
   go lifecycleController.Run(ctx.Stop)
   return nil, true, nil
}

Node-Lifecycle-Controller的定义

先来看看node-lifecycle-controller结构体的定义,先重点关注与Kube-controller-manager配置有关的变量。

  • nodeMonitorPeriod:通过--node-monitor-period设置,默认值5s,是Controller同步NodeStatus的周期
  • nodeStartupGracePeriod:通过--node-startup-grace-period设置,默认值60s,是Controller允许新启动的节点不响应的时间,如果超过这个时间Node仍然未响应,则标记Node为unhealthy
  • nodeMonitorGracePeriod:通过--node-monitor-grace-period设置,默认值40s,是Controller标识Node为unhealthy之前,允许Node不响应的时间,该值必须是kubelet nodeStatusUpdateFrequency参数的N倍,其中N表示允许kubelet传递节点状态的重试次数。
  • podEvictionTimeout:通过--pod-eviction-timeout设置,默认值5m0s,是Controller在失败节点删除pod的宽限期
  • evictionLimiterQPS:通过--node-eviction-rate设置,默认值0.1,是Controller在健康的zone中node失败时,每秒删除节点上POD的百分比,即默认10s删除完一个失败node上所有的pod。
  • secondaryEvictionLimiterQPS:通过--secondary-node-eviction-rate设置,默认值0.01,zone不健康时,当节点失败,Controller每秒删除节点上pod的百分比,即默认100s删除完一个node上的所有pod。如果zone规模小于--large-cluster-size-threshold,则该值将隐式重写为0。
  • largeClusterThreshold:通过--large-cluster-size-threshold设置,默认值50,NodeLifecycleController将集群当做大集群的nodes数量,主要目的驱逐的逻辑。对于该size或者更小size的集群,隐式的将--secondary-node-eviction-rate重写为0.
  • unhealthyZoneThreshold:通过--unhealthy-zone-threshold设置,默认值0.55,zone被视为不健康时,Not Ready 节点的比例。
  • runTaintManager:通过--enable-taint-manager设置,默认值为true,Beta功能,如果set true,则启用Noexecute Taints,并且驱逐Node上所有没有toleration Noexecute Taint的pod。
  • useTaintBasedEvictions:通过feature gate的TaintBasedEvictions设置,默认值为false,Alpha功能。如果为true,则表示将通过Taint Nodes的方式来Evict Pods。
  • taintNodeByCondition:通过feature gate的TaintNodesByCondition设置,默认值为true,Beta功能。如果为true,则表示NodeLifecycleController将会根据Node的condition(i.e 'NetworkUnavailable', 'MemoryPressure', 'OutOfDisk' and 'DiskPressure')来taint node。

注:feature gate的默认值可以在k8s.io/kubernetes/pkg/features/kube_features.go查看。

type Controller struct {
   taintManager *scheduler.NoExecuteTaintManager

   podInformerSynced cache.InformerSynced
   cloud             cloudprovider.Interface
   kubeClient        clientset.Interface

   // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
   // to aviod the problem with time skew across the cluster.
   now func() metav1.Time

   enterPartialDisruptionFunc func(nodeNum int) float32
   enterFullDisruptionFunc    func(nodeNum int) float32
   computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)

   knownNodeSet map[string]*v1.Node
   // per Node map storing last observed Status together with a local time when it was observed.
   nodeStatusMap map[string]nodeStatusData

   // Lock to access evictor workers
   evictorLock sync.Mutex

   // workers that evicts pods from unresponsive nodes.
   zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue

   // workers that are responsible for tainting nodes.
   zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

   zoneStates map[string]ZoneState

   daemonSetStore          extensionslisters.DaemonSetLister
   daemonSetInformerSynced cache.InformerSynced

   nodeLister                  corelisters.NodeLister
   nodeInformerSynced          cache.InformerSynced
   nodeExistsInCloudProvider   func(types.NodeName) (bool, error)
   nodeShutdownInCloudProvider func(context.Context, *v1.Node) (bool, error)

   recorder record.EventRecorder

   // Value controlling Controller monitoring period, i.e. how often does Controller
   // check node status posted from kubelet. This value should be lower than nodeMonitorGracePeriod.
   // TODO: Change node status monitor to watch based.
   nodeMonitorPeriod time.Duration

   // Value used if sync_nodes_status=False, only for node startup. When node
   // is just created, e.g. cluster bootstrap or node creation, we give a longer grace period.
   nodeStartupGracePeriod time.Duration

   // Value used if sync_nodes_status=False. Controller will not proactively
   // sync node status in this case, but will monitor node status updated from kubelet. If
   // it doesn't receive update for this amount of time, it will start posting "NodeReady==
   // ConditionUnknown". The amount of time before which Controller start evicting pods
   // is controlled via flag 'pod-eviction-timeout'.
   // Note: be cautious when changing the constant, it must work with nodeStatusUpdateFrequency
   // in kubelet. There are several constraints:
   // 1. nodeMonitorGracePeriod must be N times more than nodeStatusUpdateFrequency, where
   //    N means number of retries allowed for kubelet to post node status. It is pointless
   //    to make nodeMonitorGracePeriod be less than nodeStatusUpdateFrequency, since there
   //    will only be fresh values from Kubelet at an interval of nodeStatusUpdateFrequency.
   //    The constant must be less than podEvictionTimeout.
   // 2. nodeMonitorGracePeriod can't be too large for user experience - larger value takes
   //    longer for user to see up-to-date node status.
   nodeMonitorGracePeriod time.Duration

   podEvictionTimeout          time.Duration
   evictionLimiterQPS          float32
   secondaryEvictionLimiterQPS float32
   largeClusterThreshold       int32
   unhealthyZoneThreshold      float32

   // if set to true Controller will start TaintManager that will evict Pods from
   // tainted nodes, if they're not tolerated.
   runTaintManager bool

   // if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
   // taints instead of evicting Pods itself.
   useTaintBasedEvictions bool

   // if set to true, NodeController will taint Nodes based on its condition for 'NetworkUnavailable',
   // 'MemoryPressure', 'OutOfDisk' and 'DiskPressure'.
   taintNodeByCondition bool

   nodeUpdateQueue workqueue.Interface
}

NodeLifecyclecontroller.NewNodeLifecycleControlle函数

NewNodeLifecycleControlle函数创建NodeLifecycleController实例,从函数的参数可以看到NodeLifecycleController WatchList 集群的Pod、Node、daemonSet 这三个Object,下面看看该函数的主要逻辑:

  • 注册metrics接口给prometheus
  • 产生NodeLifecycle实例
  • 注册enterPartialDisruptionFunc函数为ReducedQPSFunc方法,ReducedQPSFunc方法当传入的参数大于largeClusterThreshold时返回 secondaryEvictionLimiterQPS,否则传入的参数小于等于largeClusterThreshold时返回0
  • 注册enterFullDisruptionFunc函数为HealthyQPSFunc方法,HealthyQPSFunc方法返回evictionLimiterQPS
  • 注册computeZoneStateFunc函数为Controller的ComputeZoneState方法,ComputeZoneState方法返回NotReady Node数量和Zonestate,稍后会详细阅读ComputeZoneState方法的逻辑
  • 注册podInformer EventHandler的AddFunc、UpdateFunc、DeleteFunc,如果启动了TaintManager,AddFunc、UpdateFunc、DeleteFunc会对比OldPod和newPod的Tolerations信息和Nodename信息,如果Tolerations信息和Nodename信息其中一个不相同,则会将该Pod的变更信息Add到NoExecuteTaintManager的podUpdateQueue中,交给Taint Controller处理。只不过对于AddFunc,oldPod为nil;对于DeleteFunc,newPod 为nil。
  • 注册podInformerSynced用来检查pod是否synced
  • 如果runTaintManager为true,即--enable-taint-manager参数设置为true,或者使用默认true值,则注册taintManager为NoExecuteTaintManager,并注册nodeInformer EventHandler的AddFunc、UpdateFunc、DeleteFunc,AddFunc、UpdateFunc、DeleteFunc会对比OldNode和newNode的Taint信息,如果Taint信息不一样,则会将该Node的变更信息Add到NoExecuteTaintManager的nodeUpdateQueue中,交给Taint Controller处理。只不过对于AddFunc,oldPod为nil;对于DeleteFunc,newPod 为nil。
  • 如果taintNodeByCondition为true,即feature gate的TaintNodesByCondition设置为true,或者使用默认true,则注册nodeInformer EventHandler的AddFunc、UpdateFunc,AddFunc、UpdateFunc将Node Add到NodeLifecycleController的nodeUpdateQueue中
  • 注册nodeInformer EventHandler的AddFunc、UpdateFunc,将Deprecated的Taint:"node.alpha.kubernetes.io/notReady"和"node.alpha.kubernetes.io/unreachable"转换为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable"
  • 注册nodeLister List&Watch node,注册nodeInformerSynced用来检查node是否synced
  • 注册daemonSetStore List&Watch daemonSet,注册daemonSetInformerSynced用来检查daemonSet是否synced
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:229

func NewNodeLifecycleController(podInformer coreinformers.PodInformer,
   nodeInformer coreinformers.NodeInformer,
   daemonSetInformer extensionsinformers.DaemonSetInformer,
   cloud cloudprovider.Interface,
   kubeClient clientset.Interface,
   nodeMonitorPeriod time.Duration,
   nodeStartupGracePeriod time.Duration,
   nodeMonitorGracePeriod time.Duration,
   podEvictionTimeout time.Duration,
   evictionLimiterQPS float32,
   secondaryEvictionLimiterQPS float32,
   largeClusterThreshold int32,
   unhealthyZoneThreshold float32,
   runTaintManager bool,
   useTaintBasedEvictions bool,
   taintNodeByCondition bool) (*Controller, error) {

   if kubeClient == nil {
      glog.Fatalf("kubeClient is nil when starting Controller")
   }

   eventBroadcaster := record.NewBroadcaster()
   recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "node-controller"})

   //注册metrics接口给prometheus
   if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
      metrics.RegisterMetricAndTrackRateLimiterUsage("node_lifecycle_controller",                                                         kubeClient.CoreV1().RESTClient().GetRateLimiter())
   }

   //产生NodeLifecycle实例
   nc := &Controller{
      ...
   }
   if useTaintBasedEvictions {
      glog.Infof("Controller is using taint based evictions.")
   }

   //注册enterPartialDisruptionFunc函数为ReducedQPSFunc方法,ReducedQPSFunc方法当传入的参数大于largeClusterThreshold时返回                                              secondaryEvictionLimiterQPS,否则传入的参数小于等于largeClusterThreshold时返回0
   nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc 

   //注册enterFullDisruptionFunc函数为HealthyQPSFunc方法,HealthyQPSFunc方法返回evictionLimiterQPS                               
   nc.enterFullDisruptionFunc = nc.HealthyQPSFunc

   //注册computeZoneStateFunc函数为Controller的ComputeZoneState方法,ComputeZoneState方法返回NotReady Node数量和Zonestate
   nc.computeZoneStateFunc = nc.ComputeZoneState

   //注册podInformer EventHandler的AddFunc,UpdateFunc,DeleteFunc
   podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc: func(obj interface{}) {
         pod := obj.(*v1.Pod)
         if nc.taintManager != nil {
            nc.taintManager.PodUpdated(nil, pod)
         }
      },
      UpdateFunc: func(prev, obj interface{}) {
         prevPod := prev.(*v1.Pod)
         newPod := obj.(*v1.Pod)
         if nc.taintManager != nil {
            nc.taintManager.PodUpdated(prevPod, newPod)
         }
      },
      DeleteFunc: func(obj interface{}) {
         pod, isPod := obj.(*v1.Pod)
         // We can get DeletedFinalStateUnknown instead of *v1.Pod here and we need to handle that correctly.
         if !isPod {
            deletedState, ok := obj.(cache.DeletedFinalStateUnknown)
            if !ok {
               glog.Errorf("Received unexpected object: %v", obj)
               return
            }
            pod, ok = deletedState.Obj.(*v1.Pod)
            if !ok {
               glog.Errorf("DeletedFinalStateUnknown contained non-Pod object: %v", deletedState.Obj)
               return
            }
         }
         if nc.taintManager != nil {
            nc.taintManager.PodUpdated(pod, nil)
         }
      },
   })

   //注册podInformerSynced用来检查pod是否synced
   nc.podInformerSynced = podInformer.Informer().HasSynced

   //如果runTaintManager为true,即--enable-taint-manager参数设置为true,或者使用默认true值,则注册taintManager为NoExecuteTaintManager,
     并注册nodeInformer EventHandler的AddFunc,UpdateFunc,DeleteFunc
   if nc.runTaintManager {
      nc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient)
      nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
         AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
            nc.taintManager.NodeUpdated(nil, node)
            return nil
         }),
         UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error {
            nc.taintManager.NodeUpdated(oldNode, newNode)
            return nil
         }),
         DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {
            nc.taintManager.NodeUpdated(node, nil)
            return nil
         }),
      })
   }

   //如果taintNodeByCondition为true,即feature gate的TaintNodesByCondition设置为true,或者使用默认true,
     则注册nodeInformer EventHandler的AddFunc,UpdateFunc
   if nc.taintNodeByCondition {
      glog.Infof("Controller will taint node by condition.")
      nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
         AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
            nc.nodeUpdateQueue.Add(node.Name)
            return nil
         }),
         UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error {
            nc.nodeUpdateQueue.Add(newNode.Name)
            return nil
         }),
      })
   }

   //注册nodeInformer EventHandler的AddFunc,UpdateFunc
   // NOTE(resouer): nodeInformer to substitute deprecated taint key (notReady -> not-ready).
   // Remove this logic when we don't need this backwards compatibility
   nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
         return nc.doFixDeprecatedTaintKeyPass(node)
      }),
      UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error {
         return nc.doFixDeprecatedTaintKeyPass(newNode)
      }),
   })

   //注册nodeLister List&Watch node,注册nodeInformerSynced用来检查node是否synced
   nc.nodeLister = nodeInformer.Lister()
   nc.nodeInformerSynced = nodeInformer.Informer().HasSynced

   //注册daemonSetStore List&Watch daemonSet,注册daemonSetInformerSynced用来检查daemonSet是否synced
   nc.daemonSetStore = daemonSetInformer.Lister()
   nc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced

   return nc, nil
}

NodeLifecyclecontroller.ComputeZoneState方法

NodeLifecyclecontroller.ComputeZoneState方法用来计算notReadyNodes的数量和ZoneState,逻辑如下:

  • readyNodes为0且notReadyNodes大于0,ZoneState为FullDisruption
  • notReadyNodes大于2,且notReadyNodes占总node数的百分比大于等于nc.unhealthyZoneThreshold(通过--unhealthy-zone-threshold设置,默认值0.55),ZoneState为PartialDisruption
  • 除以上两种情况,ZoneState都为Normal
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:1230

func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {
   readyNodes := 0
   notReadyNodes := 0
   for i := range nodeReadyConditions {
      if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
         readyNodes++
      } else {
         notReadyNodes++
      }
   }
   switch {
   case readyNodes == 0 && notReadyNodes > 0:
      return notReadyNodes, stateFullDisruption
   case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
      return notReadyNodes, statePartialDisruption
   default:
      return notReadyNodes, stateNormal
   }
}

NodeLifecyclecontroller.Run方法

生成NodeLifecyclecontroller实例以后,startNodeLifecycleController就会调NodeLifecyclecontroller.Run方法,NodeLifecyclecontroller.Run方法主要启动了几个go routine worker,主要逻辑如下:

  • 首先调用ontroller.WaitForCacheSync,等待NodeInformer、podInformer、daemonSetInformer的HasSyncs都返回true,即等待Node、Pod、DaemonSet 这3个Object完成同步
  • 如果runTaintManager为true,即--enable-taint-manager参数设置为true,或者使用默认true值,则go routine启动nc.taintManager.Run方法,直到接收到stop信号
  • 如果taintNodeByCondition为true,即feature gate的TaintNodesByCondition设置为true,或者使用默认true值,则启动8个go routine worker,每个worker以1s的周期执行nc.doNoScheduleTaintingPassWorker方法
  • 如果useTaintBasedEvictions为true,即feature gate的TaintBasedEvictions设置为true,则启动1个go routine worker以100ms的周期执行nc.doNoExecuteTaintingPass方法;如果useTaintBasedEvictions为false,则启动1个go routine worker以100ms的执行nc.doEvictionPass方法
  • 启动1个go routine worker以nc.nodeMonitorPeriod(该参数通过--node-monitor-period设置,默认值为5s)的周期执行nc.monitorNodeStatus()方法,进行NodeStatus的监控
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:383

func (nc *Controller) Run(stopCh <-chan struct{}) {
   defer utilruntime.HandleCrash()

   glog.Infof("Starting node controller")
   defer glog.Infof("Shutting down node controller")

   if !controller.WaitForCacheSync("taint", stopCh, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
      return
   }

   if nc.runTaintManager {
      go nc.taintManager.Run(stopCh)
   }

   if nc.taintNodeByCondition {
      // Close node update queue to cleanup go routine.
      defer nc.nodeUpdateQueue.ShutDown()

      // Start workers to update NoSchedule taint for nodes.
      for i := 0; i < scheduler.UpdateWorkerSize; i++ {
         // Thanks to "workqueue", each worker just need to get item from queue, because
         // the item is flagged when got from queue: if new event come, the new item will
         // be re-queued until "Done", so no more than one worker handle the same item and
         // no event missed.
         go wait.Until(nc.doNoScheduleTaintingPassWorker, time.Second, stopCh)
      }
   }

   if nc.useTaintBasedEvictions {
      // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
      // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
      go wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)
   } else {
      // Managing eviction of nodes:
      // When we delete pods off a node, if the node was not empty at the time we then
      // queue an eviction watcher. If we hit an error, retry deletion.
      go wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)
   }

   // Incorporate the results of node status pushed from kubelet to master.
   go wait.Until(func() {
      if err := nc.monitorNodeStatus(); err != nil {
         glog.Errorf("Error monitoring node status: %v", err)
      }
   }, nc.nodeMonitorPeriod, stopCh)

   <-stopCh
}

TaintManager

因为--enable-taint-manager参数默认值true,即NodeLifecycleController默认会启用TaintManger,所以先来看看TaintManager做了什么。从之前的代码可知,NewNodeLifecycleControlle函数会将NodeLifecycleController.taintManager注册为NoExecuteTaintManager,并在NodeLifecycleController.Run方法中go routine调用NoExecuteTaintManager.Run方法。

NewNoExecuteTaintManager函数

NewNoExecuteTaintManager就是创建了一个NoExecuteTaintManager实例。

  • NoExecuteTaintManager中有两个Queue:nodeUpdateQueue、podUpdateQueue,从NodeLifecyclecontroller.NewNodeLifecycleControlle函数中可得知,启动TaintManager后,NodeInformer和PodInformer添加EventHandler时,会调用NoExecuteTaintManager的NodeUpdated和PodUpdated方法往这两个Queue中添加nodeUpdateItem和PodUpdateItem。
  • 给taintEvictionQueue生成了一个TimedWorkerQueue,并注册函数deletePodHandler给TimedWorkerQueue.workFunc
k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:185

func NewNoExecuteTaintManager(c clientset.Interface) *NoExecuteTaintManager {
   eventBroadcaster := record.NewBroadcaster()
   recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "taint-controller"})
   eventBroadcaster.StartLogging(glog.Infof)
   if c != nil {
      glog.V(0).Infof("Sending events to api server.")
      eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: c.CoreV1().Events("")})
   } else {
      glog.Fatalf("kubeClient is nil when starting NodeController")
   }

   tm := &NoExecuteTaintManager{
      client:       c,
      recorder:     recorder,
      taintedNodes: make(map[string][]v1.Taint),

      //NodeLifecycleController会监听到node update,并将node添加到该Queue
      nodeUpdateQueue: workqueue.New(),

      //NodeLifecycleController会监听到pod update,并将pod添加到该Queue
      podUpdateQueue:  workqueue.New(),
   }

   //给taintEvictionQueue生成了一个TimedWorkerQueue,并注册函数deletePodHandler给TimedWorkerQueue.workFunc
   tm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent))

   return tm
}

deletePodHandler函数

deletePodHandler最多循环5次调用apiserver的api删除对应的pod,重试间隔10ms,删除成功则立即退出循环

k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:113

func deletePodHandler(c clientset.Interface, emitEventFunc func(types.NamespacedName)) func(args *WorkArgs) error {
   return func(args *WorkArgs) error {
      ns := args.NamespacedName.Namespace
      name := args.NamespacedName.Name
      glog.V(0).Infof("NoExecuteTaintManager is deleting Pod: %v", args.NamespacedName.String())
      if emitEventFunc != nil {
         emitEventFunc(args.NamespacedName)
      }
      var err error

      //最多循环5次调用apiserver的api删除对应的pod,重试间隔10ms,删除成功则立即退出循环
      for i := 0; i < retries; i++ {
         err = c.CoreV1().Pods(ns).Delete(name, &metav1.DeleteOptions{})
         if err == nil {
            break
         }
         time.Sleep(10 * time.Millisecond)
      }
      return err
   }
}

NoExecuteTaintManager.Run方法

NoExecuteTaintManager.Run方法实现了以下逻辑:

  • 创建长度为8的nodeUpdateChannels和长度为8的podUpdateChannels
  • 启动go routine循环从tc.nodeUpdateQueue中取出nodeUpdateItem并放入nodeUpdateChannels其中一个数组中,直到接收到结束信号
  • 启动go routine循环从tc.podUpdateQueue中取出podUpdateItem并放入nodeUpdateChannels其中一个数组中,直到接收到结束信号
  • 启动8个go routine调用tc.worker方法(即NoExecuteTaintManager.worker方法)
func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {
   glog.V(0).Infof("Starting NoExecuteTaintManager")

   //创建长度为8的nodeUpdateChannels和长度为8的podUpdateChannels
   for i := 0; i < UpdateWorkerSize; i++ {
      tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan *nodeUpdateItem, NodeUpdateChannelSize))
      tc.podUpdateChannels = append(tc.podUpdateChannels, make(chan *podUpdateItem, podUpdateChannelSize))
   }

   // Functions that are responsible for taking work items out of the workqueues and putting them
   // into channels.
   // 启动go routine循环从nodeUpdateQueue中取出nodeUpdateItem,并放入nodeUpdateChannels其中一个数组中,直到接收到结束信号
   go func(stopCh <-chan struct{}) {
      for {
         item, shutdown := tc.nodeUpdateQueue.Get()
         if shutdown {
            break
         }
         nodeUpdate := item.(*nodeUpdateItem)
         hash := hash(nodeUpdate.name(), UpdateWorkerSize)
         select {
         case <-stopCh:
            tc.nodeUpdateQueue.Done(item)
            return
         case tc.nodeUpdateChannels[hash] <- nodeUpdate:
         }
         tc.nodeUpdateQueue.Done(item)
      }
   }(stopCh)

   // 启动go routine循环从podUpdateQueue中取出podUpdateItem,并放入nodeUpdateChannels其中一个数组中,直到接收到结束信号
   go func(stopCh <-chan struct{}) {
      for {
         item, shutdown := tc.podUpdateQueue.Get()
         if shutdown {
            break
         }
         podUpdate := item.(*podUpdateItem)
         hash := hash(podUpdate.nodeName(), UpdateWorkerSize)
         select {
         case <-stopCh:
            tc.podUpdateQueue.Done(item)
            return
         case tc.podUpdateChannels[hash] <- podUpdate:
         }
         tc.podUpdateQueue.Done(item)
      }
   }(stopCh)

   wg := sync.WaitGroup{}
   wg.Add(UpdateWorkerSize)

   //启动8个go routine调用tc.worker方法,即NoExecuteTaintManager.worker方法
   for i := 0; i < UpdateWorkerSize; i++ {
      go tc.worker(i, wg.Done, stopCh)
   }
   wg.Wait()
}

NoExecuteTaintManager.worker方法

NoExecuteTaintManager.worker方法主要是监控nodeUpdateChannels和podUpdateChannels,如果有值则处理,逻辑如下:

  • 从nodeUpdateChannels中取出nodeUpdateItem,然后调用tc.handleNodeUpdate方法
  • 从podUpdateChannels中取出podUpdateItem,如果取出podUpdateItem,先确保nodeUpdateChannels中的所有nodeUpdateItem都已处理,然后调用tc.handlePodUpdate方法处理取出的podUpdateItem

下面我们分别看看handleNodeUpdate和handlePodUpdate方法各自做了什么。

k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:264

func (tc *NoExecuteTaintManager) worker(worker int, done func(), stopCh <-chan struct{}) {
   defer done()

   // When processing events we want to prioritize Node updates over Pod updates,
   // as NodeUpdates that interest NoExecuteTaintManager should be handled as soon as possible -
   // we don't want user (or system) to wait until PodUpdate queue is drained before it can
   // start evicting Pods from tainted Nodes.
   for {
      select {
      case <-stopCh:
         return
      //从nodeUpdateChannels中取出nodeUpdateItem,然后调用tc.handleNodeUpdate方法
      case nodeUpdate := <-tc.nodeUpdateChannels[worker]:
         tc.handleNodeUpdate(nodeUpdate)
      //从podUpdateChannels中取出podUpdateItem,如果取出,先确保nodeUpdateChannels中的所有nodeUpdateItem都已处理,
        然后调用c.handlePodUpdate方法
      case podUpdate := <-tc.podUpdateChannels[worker]:
         // If we found a Pod update we need to empty Node queue first.
      priority:
         for {
            select {
            case nodeUpdate := <-tc.nodeUpdateChannels[worker]:
               tc.handleNodeUpdate(nodeUpdate)
            default:
               break priority
            }
         }
         // After Node queue is emptied we process podUpdate.
         tc.handlePodUpdate(podUpdate)
      }
   }
}

NoExecuteTaintManager.handleNodeUpdate方法

  • 如果nodeUpdate.newNode是nil,说明是Delete Node,所以将Node的taint从tc.taintedNodes删除之后return
  • 如果是Create或者Update
    • 从nodeUpdate取出newNode和newTaints,并更新到tc.taintedNodes(如果newTaints为空,则是从tc.taintedNodes中删除该Node
    • 获取node上的所有pod
    • 如果node没有taint,则取消该node上所有noexecute taint的pod eviction,完成后return
    • 对node上的所有pod执行tc.processPodOnNode
k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:418

func (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate *nodeUpdateItem) {
   // Delete
   //如果nodeUpdate.newNode是nil,说明是要删除Node,所以将Node的taint从tc.taintedNodes删除
   if nodeUpdate.newNode == nil {
      node := nodeUpdate.oldNode
      glog.V(4).Infof("Noticed node deletion: %#v", node.Name)
      tc.taintedNodesLock.Lock()
      defer tc.taintedNodesLock.Unlock()
      delete(tc.taintedNodes, node.Name)
      return
   }

   // Create or Update
   //如果是Create或者Update,从nodeUpdate取出newNode和newTaints,并更新到tc.taintedNodes
   glog.V(4).Infof("Noticed node update: %#v", nodeUpdate)
   node := nodeUpdate.newNode
   taints := nodeUpdate.newTaints
   func() {
      tc.taintedNodesLock.Lock()
      defer tc.taintedNodesLock.Unlock()
      glog.V(4).Infof("Updating known taints on node %v: %v", node.Name, taints)
      if len(taints) == 0 {
         delete(tc.taintedNodes, node.Name)
      } else {
         tc.taintedNodes[node.Name] = taints
      }
   }()

   //获取node上的所有pod
   pods, err := getPodsAssignedToNode(tc.client, node.Name)
   if err != nil {
      glog.Errorf(err.Error())
      return
   }
   if len(pods) == 0 {
      return
   }
   // Short circuit, to make this controller a bit faster.
   //如果node没有taint,则取消该node上所有noexecute taint的pod eviction
   if len(taints) == 0 {
      glog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name)
      for i := range pods {
         tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})
      }
      return
   }

   //对node上的所有pod执行tc.processPodOnNode
   now := time.Now()
   for i := range pods {
      pod := &pods[i]
      podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}
      tc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now)
   }
}

NoExecuteTaintManager.handlePodUpdate方法

  • 如果podUpdate.newPod为nil,说明是删除pod,则取消pod的noexecute taint的eviction,完成后return
  • 如果是Create或者Update,获取podUpdate.newPod所在的node的所有taints,并对podUpdate.newPod执行tc.processPodOnNode方法
k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:387

func (tc *NoExecuteTaintManager) handlePodUpdate(podUpdate *podUpdateItem) {
   // Delete
   //如果podUpdate.newPod为nil,说明是删除pod,则取消pod的noexecute taint的eviction
   if podUpdate.newPod == nil {
      pod := podUpdate.oldPod
      podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}
      glog.V(4).Infof("Noticed pod deletion: %#v", podNamespacedName)
      tc.cancelWorkWithEvent(podNamespacedName)
      return
   }
   // Create or Update
   //如果是Create或者Update,获取podUpdate.newPod所在的node的所有taints,并对podUpdate.newPod执行tc.processPodOnNode方法
   pod := podUpdate.newPod
   podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}
   glog.V(4).Infof("Noticed pod update: %#v", podNamespacedName)
   nodeName := pod.Spec.NodeName
   if nodeName == "" {
      return
   }
   taints, ok := func() ([]v1.Taint, bool) {
      tc.taintedNodesLock.Lock()
      defer tc.taintedNodesLock.Unlock()
      taints, ok := tc.taintedNodes[nodeName]
      return taints, ok
   }()
   // It's possible that Node was deleted, or Taints were removed before, which triggered
   // eviction cancelling if it was needed.
   if !ok {
      return
   }
   tc.processPodOnNode(podNamespacedName, nodeName, podUpdate.newTolerations, taints, time.Now())
}

NoExecuteTaintManager.processPodOnNode方法

handleNodeUpdate & handlePodUpdate最后都是调用processPodOnNode方法来做pod的eviction,下面就看看processPodOnNode的逻辑。

  • 如果node没有taint,则取消pod的eviction
  • 检查node的taints是否pod都有对应的toleration,如果不是所有taint都有toleration,则取消pod的eviction,并调用tc.taintEvictionQueue.AddWork将pod添加到tc.taintEvictionQueue,即时启用tc.taintEvictionQueue的deletePodHandler删除pod
  • 获取所有的toleration的toleration second(toleration second设置小于0则当做0处理),并将最小值作为minTolerationTime,如果minTolerationTime小于0(即所有的toleration都没有设置toleration second),则表示永远容忍,直接return
  • 计算pod eviction的触发时间triggerTime
  • 从tc.taintEvictionQueue获取pod的worker scheduledEviction,如果scheduledEviction不为空,如果(scheduledEviction.Createdtime + minTolerationTime)没达到触发时间triggerTime,则直接return,如果达到了,则从worker中删除该pod(注:triggerTime是用time.Now() + minTolerationTime计算得来的,很奇怪这里为什么不直接用scheduledEviction.Createdtime和time.Now()做对比)
  • 调用tc.taintEvictionQueue.AddWork将pod加到tc.taintEvictionQueue中,并启动tc.taintEvictionQueue注册的deletePodHandler来删除该pod

到这里,TaintManager的所有逻辑就已经分析完了。接下来就继续看看NodeLifecyclecontroller.Run启动的其他逻辑。

k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager.go:349

func (tc *NoExecuteTaintManager) processPodOnNode(
   podNamespacedName types.NamespacedName,
   nodeName string,
   tolerations []v1.Toleration,
   taints []v1.Taint,
   now time.Time,
) {

   //如果node没有taint,则取消pod的eviction
   if len(taints) == 0 {
      tc.cancelWorkWithEvent(podNamespacedName)
   }

   //检查node的taints是否pod都有对应的toleration,如果不是所有taint都有toleration,则取消pod的eviction,并调用tc.taintEvictionQueue.AddWork将pod添加到tc.taintEvictionQueue,即时启用tc.taintEvictionQueue的deletePodHandler删除pod
   allTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations)
   if !allTolerated {
      glog.V(2).Infof("Not all taints are tolerated after update for Pod %v on %v", podNamespacedName.String(), nodeName)
      // We're canceling scheduled work (if any), as we're going to delete the Pod right away.
      tc.cancelWorkWithEvent(podNamespacedName)
      tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now())
      return
   }

   //获取所有的toleration的toleration second(toleration second设置小于0则当做0处理),并将最小值作为minTolerationTime,如果minTolerationTime小于0(即所有的toleration都没有设置toleration second),则表示永远容忍,直接return
   minTolerationTime := getMinTolerationTime(usedTolerations)
   // getMinTolerationTime returns negative value to denote infinite toleration.
   if minTolerationTime < 0 {
      glog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String())
      return
   }

   //计算pod eviction的触发时间
   startTime := now
   triggerTime := startTime.Add(minTolerationTime)

   //从tc.taintEvictionQueue获取pod的worker scheduledEviction,如果scheduledEviction不为空,如果(scheduledEviction.Createdtime + minTolerationTime)没达到触发时间triggerTime,则直接return,如果达到了,则从worker中删除该pod
   scheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String())
   if scheduledEviction != nil {
      startTime = scheduledEviction.CreatedAt
      if startTime.Add(minTolerationTime).Before(triggerTime) {
         return
      }
      tc.cancelWorkWithEvent(podNamespacedName)
   }

   //调用tc.taintEvictionQueue.AddWork将pod加到tc.taintEvictionQueue中,并启动tc.taintEvictionQueue注册的deletePodHandler来删除该pod
   tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime)
}

NodeLifecyclecontroller.doNoScheduleTaintingPassWorker方法

如果feature gate的TaintNodesByCondition设置为true,或者使用默认true值,NodeLifecyclecontroller会启动8个go routine worker,每个worker以1s的周期执行nc.doNoScheduleTaintingPassWorker方法,doNoScheduleTaintingPassWorker里面启动一个直到接收到stopCh信号才结束的循环,循环里面主要实现了以下逻辑:

  • 从nc.nodeUpdateQueue中取出object
  • 根据上一步取出的object的node.name执行doNoScheduleTaintingPass方法。doNoScheduleTaintingPass方法逻辑如下:
    • 根据nodename从apiserver获取node object
    • 检查node的Conditions的type和status是否有("Ready"为False或者Unknown,"MemoryPressure"、 "OutOfDisk"、 "DiskPressure"、 "NetworkUnavailable"、 "PIDPressure"为True)如果有,添加Conditions status对应的taint key和"NoSchedule"到taints中
    • 如果node Spec有Unschedulable的status,则添加Key:"node.kubernetes.io/unschedulable",Effect:"NoSchedule"到taints中
    • 筛选node现有的Taints中 taint.Effect是”NoSchedule“且taint.Key为("node.kubernetes.io/unschedulable"、 "node.kubernetes.io/not-ready"、 "node.kubernetes.io/unreachable"、 "node.kubernetes.io/network-unavailable"、 "node.kubernetes.io/memory-pressure"、 "node.kubernetes.io/out-of-disk"、 "node.kubernetes.io/disk-pressure"、 "node.kubernetes.io/pid-pressure")到nodeTaints
    • 对比taints和nodetaints,获取两者之间的差异taintsToAdd、taintsToDel,之后调用nodeutil.SwapNodeControllerTaint通过taintsToAdd、taintsToDel将node的taints更新为和taints一样
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:470

func (nc *Controller) doNoScheduleTaintingPassWorker() {
   for {
      obj, shutdown := nc.nodeUpdateQueue.Get()
      // "nodeUpdateQueue" will be shutdown when "stopCh" closed;
      // we do not need to re-check "stopCh" again.
      if shutdown {
         return
      }
      nodeName := obj.(string)

      if err := nc.doNoScheduleTaintingPass(nodeName); err != nil {
         // TODO (k82cn): Add nodeName back to the queue.
         glog.Errorf("Failed to taint NoSchedule on node <%s>, requeue it: %v", nodeName, err)
      }
      nc.nodeUpdateQueue.Done(nodeName)
   }
}

func (nc *Controller) doNoScheduleTaintingPass(nodeName string) error {
   //根据nodename从apiserver获取node object
   node, err := nc.nodeLister.Get(nodeName)
   if err != nil {
      // If node not found, just ignore it.
      if apierrors.IsNotFound(err) {
         return nil
      }
      return err
   }

   // Map node's condition to Taints.
   //检查node的Conditions的type和status是否有("Ready"为False或者Unknown,"MemoryPressure"、 "OutOfDisk"、 "DiskPressure"、 "NetworkUnavailable"、 "PIDPressure"为True)如果有,添加Conditions status对应的taint key和"NoSchedule"到taints
   var taints []v1.Taint
   for _, condition := range node.Status.Conditions {
      if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {
         if taintKey, found := taintMap[condition.Status]; found {
            taints = append(taints, v1.Taint{
               Key:    taintKey,
               Effect: v1.TaintEffectNoSchedule,
            })
         }
      }
   }

   //如果node Spec有Unschedulable的status,则添加Key:"node.kubernetes.io/unschedulable",Effect:"NoSchedule"到taints中
   if node.Spec.Unschedulable {
      // If unschedulable, append related taint.
      taints = append(taints, v1.Taint{
         Key:    algorithm.TaintNodeUnschedulable,
         Effect: v1.TaintEffectNoSchedule,
      })
   }

   // Get exist taints of node.
   //筛选node现有的Taints中 taint.Effect是”NoSchedule“且taint.Key为("node.kubernetes.io/unschedulable"、 "node.kubernetes.io/not-ready"、 "node.kubernetes.io/unreachable"、 "node.kubernetes.io/network-unavailable"、 "node.kubernetes.io/memory-pressure"、 "node.kubernetes.io/out-of-disk"、 "node.kubernetes.io/disk-pressure"、 "node.kubernetes.io/pid-pressure")到nodeTaints
   nodeTaints := taintutils.TaintSetFilter(node.Spec.Taints, func(t *v1.Taint) bool {
      // only NoSchedule taints are candidates to be compared with "taints" later
      if t.Effect != v1.TaintEffectNoSchedule {
         return false
      }
      // Find unschedulable taint of node.
      if t.Key == algorithm.TaintNodeUnschedulable {
         return true
      }
      // Find node condition taints of node.
      _, found := taintKeyToNodeConditionMap[t.Key]
      return found
   })

   //对比taints和nodetaints,获取两者之间的差异taintsToAdd、taintsToDel,之后调用nodeutil.SwapNodeControllerTaint将node的taints更新为和taints一样
   taintsToAdd, taintsToDel := taintutils.TaintSetDiff(taints, nodeTaints)
   // If nothing to add not delete, return true directly.
   if len(taintsToAdd) == 0 && len(taintsToDel) == 0 {
      return nil
   }

   if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, taintsToAdd, taintsToDel, node) {
      return fmt.Errorf("failed to swap taints of node %+v", node)
   }
   return nil
}

NodeLifecyclecontroller.doNoExecuteTaintingPass方法

如果useTaintBasedEvictions为true,即feature gate的TaintBasedEvictions设置为true,NodeLifecyclecontroller会启动1个go routine worker以100ms的周期执行nc.doNoExecuteTaintingPass方法。doNoExecuteTaintingPass方法从NodeLifecyclecontroller.zoneNoExecuteTainter中循环取出RateLimitedTimedQueue,执行RateLimitedTimedQueue.Try方法,Try方法执行了以下逻辑:

  • 调用RateLimitedTimedQueue.queue.Head()方法,取出最早的queue
  • 启动一个循环执行以下逻辑,直到返回ok为止:
    • 传入scheduler.TimedValue执行fn函数,fn函数逻辑如下
      • 先根据TimedValue.Value从apiserver取得node object,
        • 如果返回找不到node,写出warning日志,并return ok;
        • 如果返回获取不成功,写出warning日志并return,并return notok;
        • 如果获取成功则写出evictions_number 到 prometheus metrics 
      • 获取node的Condition
        • 如果Condition是False,则将taintToAdd置为(Key:"node.kubernetes.io/not-ready",Effect: "NoExecute"),oppositeTaint置为(Key:"node.kubernetes.io/unreachable",Effect:"NoExecute");
        • 如果Condition为Unknown,则taintToAdd置为(Key:"node.kubernetes.io/unreachable",Effect:"NoExecute"),oppositeTaint置为(Key:"node.kubernetes.io/not-ready",Effect: "NoExecute");
        • 如果Condition的值既不是False也不是Unknown,则直接return ok
      • 使用taintToAdd和oppositeTaint调用nodeutil.SwapNodeControllerTaint更新node的taints,如果成功则返回ok
    • 如果fn函数返回ok,则调用RateLimitedTimedQueue.queue.RemoveFromQueue将上一步获得的queue删除;如果fn返回不ok,则将queue的ProcessAt修改为now + 1 + fn函数返回的wait时间,并将该queue又放到RateLimitedTimedQueue的最前面
    • 重新调用RateLimitedTimedQueue.queue.Head()方法,取出下一条queue
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:543

func (nc *Controller) doNoExecuteTaintingPass() {
   nc.evictorLock.Lock()
   defer nc.evictorLock.Unlock()
   for k := range nc.zoneNoExecuteTainter {
      // Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
      nc.zoneNoExecuteTainter[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
         //通过apiserver获取node object
         node, err := nc.nodeLister.Get(value.Value)
         if apierrors.IsNotFound(err) {
            glog.Warningf("Node %v no longer present in nodeLister!", value.Value)
            return true, 0
         } else if err != nil {
            glog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
            // retry in 50 millisecond
            return false, 50 * time.Millisecond
         } else {
            zone := utilnode.GetZoneKey(node)
            evictionsNumber.WithLabelValues(zone).Inc()
         }
         _, condition := v1node.GetNodeCondition(&node.Status, v1.NodeReady)
         // Because we want to mimic NodeStatus.Condition["Ready"] we make "unreachable" and "not ready" taints mutually exclusive.
         taintToAdd := v1.Taint{}
         oppositeTaint := v1.Taint{}
         //根据node的Condition status做不同的处理
         if condition.Status == v1.ConditionFalse {
            taintToAdd = *NotReadyTaintTemplate
            oppositeTaint = *UnreachableTaintTemplate
         } else if condition.Status == v1.ConditionUnknown {
            taintToAdd = *UnreachableTaintTemplate
            oppositeTaint = *NotReadyTaintTemplate
         } else {
            // It seems that the Node is ready again, so there's no need to taint it.
            glog.V(4).Infof("Node %v was in a taint queue, but it's ready now. Ignoring taint request.", value.Value)
            return true, 0
         }

         //调用nodeutil.SwapNodeControllerTaint更新node的taint
         return nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{&oppositeTaint}, node), 0
      })
   }
}


k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler/rate_limited_queue.go:232
func (q *RateLimitedTimedQueue) Try(fn ActionFunc) {
   val, ok := q.queue.Head()
   q.limiterLock.Lock()
   defer q.limiterLock.Unlock()
   for ok {
      // rate limit the queue checking
      if !q.limiter.TryAccept() {
         glog.V(10).Infof("Try rate limited for value: %v", val)
         // Try again later
         break
      }

      now := now()
      if now.Before(val.ProcessAt) {
         break
      }

      if ok, wait := fn(val); !ok {
         val.ProcessAt = now.Add(wait + 1)
         q.queue.Replace(val)
      } else {
         q.queue.RemoveFromQueue(val.Value)
      }
      val, ok = q.queue.Head()
   }
}

NodeLifecyclecontroller.doNoExecutePass方法

如果useTaintBasedEvictions为true,即feature gate的TaintBasedEvictions设置为false,或者使用默认false,NodeLifecyclecontroller会启动1个go routine worker以100ms的周期执行nc.doNoExecutePass方法。doNoExecutePass方法从NodeLifecyclecontroller.zonePodEvictor中循环取出RateLimitedTimedQueue,执行RateLimitedTimedQueue.Try方法。Try方法的逻辑可以参考上一步的分析,这里主要看看Try方法传入的fn函数的逻辑:

  • 第一步通过apiserver获取node object
  • 第二步调用nodeutil.DeletePods做pod的eviction,nodeutil.DeletePods处理逻辑如下:
    • 初始化remaining为false,获取传入node上所有的pod
    • 启动循环,对上一步获取到的所有pod做如下处理:
      • 设置Pod的status的Reason设置为"NodeLost",Message设置为"Node %v which was running pod %v is unresponsive",如果设置失败,添加error message到updateErrList,并跳过该pod
      • 检查pod的DeletionGracePeriodSeconds,如果不为空,说明该pod正在删除,设置remaining为true,并跳过该pod
      • 检查pod是否是DaemonSets生成的,如果是,则跳过该pod
      • 调用apiserver接口删除该pod,return remaining为false(注意,如果这时候node上的kubelet进程无响应、停止或者node网络丢失,调用apiserver接口删除不了pod
      • 将remaining设为true
    • 检查updateErrList,如果大于0,则return remaining为false,及updateErrList
  • 检查上一步nodeutil.DeletePods返回的remaining变量,如果remaining为true,则写一笔info日志
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:582

func (nc *Controller) doEvictionPass() {
   nc.evictorLock.Lock()
   defer nc.evictorLock.Unlock()
   for k := range nc.zonePodEvictor {
      // Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
      nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
         //通过apiserver获取node object
         node, err := nc.nodeLister.Get(value.Value)
         if apierrors.IsNotFound(err) {
            glog.Warningf("Node %v no longer present in nodeLister!", value.Value)
         } else if err != nil {
            glog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
         } else {
            zone := utilnode.GetZoneKey(node)
            evictionsNumber.WithLabelValues(zone).Inc()
         }
         nodeUID, _ := value.UID.(string)
         //调用nodeutil.DeletePods做pod的eviction
         remaining, err := nodeutil.DeletePods(nc.kubeClient, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)
         if err != nil {
            utilruntime.HandleError(fmt.Errorf("unable to evict node %q: %v", value.Value, err))
            return false, 0
         }
         if remaining {
            glog.Infof("Pods awaiting deletion due to Controller eviction")
         }
         return true, 0
      })
   }
}

k8s.io/kubernetes/pkg/controller/util/node/controller_utils.go:55
func DeletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) {
   //初始化remaining为false
   remaining := false
   //获取node上所有的pod
   selector := fields.OneTermEqualSelector(api.PodHostField, nodeName).String()
   options := metav1.ListOptions{FieldSelector: selector}
   pods, err := kubeClient.CoreV1().Pods(metav1.NamespaceAll).List(options)
   var updateErrList []error

   if err != nil {
      return remaining, err
   }

   if len(pods.Items) > 0 {
      RecordNodeEvent(recorder, nodeName, nodeUID, v1.EventTypeNormal, "DeletingAllPods", fmt.Sprintf("Deleting all Pods from Node %v.", nodeName))
   }
   
   //启动循环,对上一步获取的node做如下处理
   for _, pod := range pods.Items {
      // Defensive check, also needed for tests.
      //检查pod.Spec.Nodename是否是传入的nodeName
      if pod.Spec.NodeName != nodeName {
         continue
      }

      // Set reason and message in the pod object.
      //设置Pod的status的Reason设置为"NodeLost",Message设置为"Node %v which was running pod %v is unresponsive",如果设置失败,添加error message到updateErrList,并跳过该pod
      if _, err = SetPodTerminationReason(kubeClient, &pod, nodeName); err != nil {
         if apierrors.IsConflict(err) {
            updateErrList = append(updateErrList,
               fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err))
            continue
         }
      }
      // if the pod has already been marked for deletion, we still return true that there are remaining pods.
      //检查pod的DeletionGracePeriodSeconds,如果不为空,设置remaining为true,并跳过该pod
      if pod.DeletionGracePeriodSeconds != nil {
         remaining = true
         continue
      }
      // if the pod is managed by a daemonset, ignore it
      //检查pod是否是DaemonSets生成的,如果是,则跳过该pod
      _, err := daemonStore.GetPodDaemonSets(&pod)
      if err == nil { // No error means at least one daemonset was found
         continue
      }

      glog.V(2).Infof("Starting deletion of pod %v/%v", pod.Namespace, pod.Name)
      recorder.Eventf(&pod, v1.EventTypeNormal, "NodeControllerEviction", "Marking for deletion Pod %s from Node %s", pod.Name, nodeName)
      //调用apiserver接口删除该pod
      if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
         return false, err
      }
      //将remaining设为true
      remaining = true
   }

   //检查updateErrList,如果大于0,则return remaining为false,及updateErrList
   if len(updateErrList) > 0 {
      return false, utilerrors.NewAggregate(updateErrList)
   }
   return remaining, nil
}

NodeLifecyclecontroller.monitorNodeStatus()方法

NodeLifecyclecontroller.Run方法在最后会启动1个go routine worker以nc.nodeMonitorPeriod(该参数通过--node-monitor-period设置,默认值为5s)的周期执行nc.monitorNodeStatus()方法,进行NodeStatus的监控

  • 通过apiserver获取集群现在的node list到nodes
  • 调用nc.classifyNodes对比上一步获得的nodes和NodeLifecycleController中的缓存的knownNodeSet,获得added node(nodes有,nc.knownNodeSet没有)、deleted(nodes没有,nc.knownNodeSet有)、及newZoneRepresentatives(node在nodes和nc.knownNodeSet都有,只是nodes中node的zone是新增的,在NodeLifecycleController缓存的zoneStates中没有)
  • 调用addPodEvictorForNewZone添加新zone到zonePodEvictor (useTaintBasedEvictions为false)或zoneNoExecuteTainter (useTaintBasedEvictions为true)
  • 循环处理added中的所有node
    • 调用addPodEvictorForNewZone检查node的zone,如果zone在NodeLifecycleController缓存的zoneStates中没有,则将zone添加到zonePodEvictor (useTaintBasedEvictions为false)或zoneNoExecuteTainter (useTaintBasedEvictions为true)
    • 调用markNodeAsReachable删除nodeunreachable和not-ready的taint (useTaintBasedEvictions为true)或者调用cancelPodEviction从nc.zonePodEvictor删除该node中pod的eviction (useTaintBasedEvictions为false)
  • 将deleted中的node从nc.knownNodeSet删除
  • 启动循环处理nodes中所有的node
    • 通过wait.PollImmediate函数每间隔20ms调用一次nc.tryUpdateNodeStatus update node status直到update成功或者过了100ms返回超时,并返回gracePeriod, observedReadyCondition, currentReadyCondition
    • 如果不是node不是Master,则将node的Condition更新到node对应的zoneToNodeConditions
    • 检查currentReadyCondition,如果不为nil,说明不是新的Node,则执行以下处理
      • 如果observedReadyCondition Status是False
        • 如果开启了useTaintBasedEvictions,检查node是否已经有Key为"node.kubernetes.io/unreachable"的NoExecute Taint:如果有unreachable的NoExecute Taint,则删除unreachable的NoExecute Taint,并添加Key为"node.kubernetes.io/not-ready"的NoExecute Taint;如果没有unreachable的NoExecute Taint,调用nc.markNodeForTainting,将node添加到对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)
        • 如果没有开启useTaintBasedEvictions,如果kubelet上次上报Ready Condition的时间已经是5分钟之前,则调用nc.evictPods将node添加到pod eviction queue(nc.zonePodEvictor)
      • 如果observedReadyCondition Status是Unknown
        • 如果开启了useTaintBasedEvictions,检查node是否已经有Key为"node.kubernetes.io/not-ready"的NoExecute Taint:如果有not-ready的NoExecute Taint,则删除not-ready的NoExecute Taint,并添加Key为"node.kubernetes.io/unreachable"的NoExecute Taint;如果没有not-ready的NoExecute Taint,调用nc.markNodeForTainting,将node添加到对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)
        • 如果没有开启useTaintBasedEvictions,如果kubelet上次上报Ready Condition的时间已经是5分钟之前,则调用nc.evictPods将node添加到pod eviction queue(nc.zonePodEvictor)
      • 如果observedReadyCondition Status是True
        • 如果开启了useTaintBasedEvictions,调用nc.markNodeAsReachable删除"node.kubernetes.io/unreachable" & "node.kubernetes.io/not-ready" NoExecute Taint,并将node从对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)中删除
        • 如果没有开启useTaintBasedEvictions,调用nc.cancelPodEviction将node从pod eviction queue(nc.zonePodEvictor)中删除
        • 调用nc.markNodeAsNotShutdown删除node中Key为"node.cloudprovider.kubernetes.io/shutdown"的"NoSchedule"类型的Taint
      • 如果currentReadyCondition Status不是True,而observedReadyCondition Status为true,则调用nodeutil.MarkAllPodsNotReady 将node上所有的pod的PodReady Status 置为False
      • 如果currentReadyCondition Status不为True,而且cloud Provider不为空,则会检查node是否还存在,如果不存在,立即从集群中删除该node,因为涉及到cloud provider,本文暂不深入阅读里面的逻辑
  • 调用nc.handleDisruption
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:614

func (nc *Controller) monitorNodeStatus() error {
   // We are listing nodes from local cache as we can tolerate some small delays
   // comparing to state from etcd and there is eventual consistency anyway.
   //通过apiserver获取集群现在的node list到nodes
   nodes, err := nc.nodeLister.List(labels.Everything())
   if err != nil {
      return err
   }
   //调用nc.classifyNodes对比上一步获得的nodes和NodeLifecycleController中的缓存的knownNodeSet,获得added node(nodes有,nc.knownNodeSet没有)、deleted(nodes没有,nc.knownNodeSet有)、及newZoneRepresentatives(node在nodes和nc.knownNodeSet都有,只是nodes中node的zone是新增的,在NodeLifecycleController缓存的zoneStates中没有)
   added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)

   //调用addPodEvictorForNewZone添加新zone到zonePodEvictor (useTaintBasedEvictions为false)或zoneNoExecuteTainter (useTaintBasedEvictions为true)
   for i := range newZoneRepresentatives {
      nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
   }

   //循环处理added中的所有node,首先调用addPodEvictorForNewZone检查node的zone,如果zone在NodeLifecycleController缓存的zoneStates中没有,则将zone添加到zonePodEvictor (useTaintBasedEvictions为false)或zoneNoExecuteTainter (useTaintBasedEvictions为true),第二步调用markNodeAsReachable删除nodeunreachable和not-ready的taint (useTaintBasedEvictions为true)或者调用cancelPodEviction从nc.zonePodEvictor删除该node中pod的eviction (useTaintBasedEvictions为false)
   for i := range added {
      glog.V(1).Infof("Controller observed a new Node: %#v", added[i].Name)
      nodeutil.RecordNodeEvent(nc.recorder, added[i].Name, string(added[i].UID), v1.EventTypeNormal, "RegisteredNode", fmt.Sprintf("Registered Node %v in Controller", added[i].Name))
      nc.knownNodeSet[added[i].Name] = added[i]
      nc.addPodEvictorForNewZone(added[i])
      if nc.useTaintBasedEvictions {
         nc.markNodeAsReachable(added[i])
      } else {
         nc.cancelPodEviction(added[i])
      }
   }

   //将deleted中的node从nc.knownNodeSet删除
   for i := range deleted {
      glog.V(1).Infof("Controller observed a Node deletion: %v", deleted[i].Name)
      nodeutil.RecordNodeEvent(nc.recorder, deleted[i].Name, string(deleted[i].UID), v1.EventTypeNormal, "RemovingNode", fmt.Sprintf("Removing Node %v from Controller", deleted[i].Name))
      delete(nc.knownNodeSet, deleted[i].Name)
   }

   zoneToNodeConditions := map[string][]*v1.NodeCondition{}
   //启动循环处理nodes中所有的node
   for i := range nodes {
      var gracePeriod time.Duration
      var observedReadyCondition v1.NodeCondition
      var currentReadyCondition *v1.NodeCondition
      node := nodes[i].DeepCopy()
      //通过wait.PollImmediate函数每间隔20ms调用一次nc.tryUpdateNodeStatus update node status直到update成功或者过了100ms返回超时,并返回gracePeriod, observedReadyCondition, currentReadyCondition
      if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeStatusUpdateRetry, func() (bool, error) {
         gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeStatus(node)
         if err == nil {
            return true, nil
         }
         name := node.Name
         node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
         if err != nil {
            glog.Errorf("Failed while getting a Node to retry updating NodeStatus. Probably Node %s was deleted.", name)
            return false, err
         }
         return false, nil
      }); err != nil {
         glog.Errorf("Update status of Node '%v' from Controller error: %v. "+
            "Skipping - no pods will be evicted.", node.Name, err)
         continue
      }

      // We do not treat a master node as a part of the cluster for network disruption checking.
      //如果不是node不是Master,则将node的Condition更新到node对应的zoneToNodeConditions
      if !system.IsMasterNode(node.Name) {
         zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
      }

      decisionTimestamp := nc.now()
      //检查currentReadyCondition,如果不为nil,说明不是新的Node,则执行以下处理
      if currentReadyCondition != nil {
         // Check eviction timeout against decisionTimestamp
         //如果observedReadyCondition Status是False
         if observedReadyCondition.Status == v1.ConditionFalse {
            //如果开启了useTaintBasedEvictions,检查node是否已经有Key为"node.kubernetes.io/unreachable"的NoExecute Taint:如果有unreachable的NoExecute Taint,则删除unreachable的NoExecute Taint,并添加Key为"node.kubernetes.io/not-ready"的NoExecute Taint;如果没有unreachable的NoExecute Taint,调用nc.markNodeForTainting,将node添加到对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)
            if nc.useTaintBasedEvictions {
               // We want to update the taint straight away if Node is already tainted with the UnreachableTaint
               if taintutils.TaintExists(node.Spec.Taints, UnreachableTaintTemplate) {
                  taintToAdd := *NotReadyTaintTemplate
                  if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{UnreachableTaintTemplate}, node) {
                     glog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
                  }
               } else if nc.markNodeForTainting(node) {
                  glog.V(2).Infof("Node %v is NotReady as of %v. Adding it to the Taint queue.",
                     node.Name,
                     decisionTimestamp,
                  )
               }
            //如果没有开启useTaintBasedEvictions,如果kubelet上次上报Ready Condition的时间已经是5分钟之前,则调用nc.evictPods将node添加到pod eviction queue(nc.zonePodEvictor)
            } else {
               if decisionTimestamp.After(nc.nodeStatusMap[node.Name].readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {
                  if nc.evictPods(node) {
                     glog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v",
                        node.Name,
                        decisionTimestamp,
                        nc.nodeStatusMap[node.Name].readyTransitionTimestamp,
                        nc.podEvictionTimeout,
                     )
                  }
               }
            }
         }
        
         //如果observedReadyCondition Status是Unknown
         if observedReadyCondition.Status == v1.ConditionUnknown {
            //如果开启了useTaintBasedEvictions,检查node是否已经有Key为"node.kubernetes.io/not-ready"的NoExecute Taint:如果有not-ready的NoExecute Taint,则删除not-ready的NoExecute Taint,并添加Key为"node.kubernetes.io/unreachable"的NoExecute Taint;如果没有not-ready的NoExecute Taint,调用nc.markNodeForTainting,将node添加到对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)
            if nc.useTaintBasedEvictions {
               // We want to update the taint straight away if Node is already tainted with the UnreachableTaint
               if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {
                  taintToAdd := *UnreachableTaintTemplate
                  if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {
                     glog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
                  }
               } else if nc.markNodeForTainting(node) {
                  glog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.",
                     node.Name,
                     decisionTimestamp,
                  )
               }
            //如果没有开启useTaintBasedEvictions,如果kubelet上次上报Ready Condition的时间已经是5分钟之前,则调用nc.evictPods将node添加到pod eviction queue(nc.zonePodEvictor)
            } else {
               if decisionTimestamp.After(nc.nodeStatusMap[node.Name].probeTimestamp.Add(nc.podEvictionTimeout)) {
                  if nc.evictPods(node) {
                     glog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v",
                        node.Name,
                        decisionTimestamp,
                        nc.nodeStatusMap[node.Name].readyTransitionTimestamp,
                        nc.podEvictionTimeout-gracePeriod,
                     )
                  }
               }
            }
         }
         //如果observedReadyCondition Status是True
         if observedReadyCondition.Status == v1.ConditionTrue {
           //如果开启了useTaintBasedEvictions,调用nc.markNodeAsReachable删除"node.kubernetes.io/unreachable" & "node.kubernetes.io/not-ready" NoExecute Taint,并将node从对应zone的NoExecute Taint queue(nc.zoneNoExecuteTainter)中删除
            if nc.useTaintBasedEvictions {
               removed, err := nc.markNodeAsReachable(node)
               if err != nil {
                  glog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)
               }
               if removed {
                  glog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name)
               }
            } else {
               //如果没有开启useTaintBasedEvictions,调用nc.cancelPodEviction将node从pod eviction queue(nc.zonePodEvictor)中删除
               if nc.cancelPodEviction(node) {
                  glog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
               }
            }
            // remove shutdown taint this is needed always depending do we use taintbased or not
            //调用nc.markNodeAsNotShutdown删除node中Key为"node.cloudprovider.kubernetes.io/shutdown"的"NoSchedule"类型的Taint
            err := nc.markNodeAsNotShutdown(node)
            if err != nil {
               glog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)
            }
         }

         // Report node event.
         //如果currentReadyCondition Status不是True,而observedReadyCondition Status为true,则调用nodeutil.MarkAllPodsNotReady 将node上所有的pod的PodReady Status 置为False
         if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
            nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
            if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {
               utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))
            }
         }

         // Check with the cloud provider to see if the node still exists. If it
         // doesn't, delete the node immediately.
         //如果currentReadyCondition Status不为True,而且cloud Provider不为空,则会检查node是否还存在,如果不存在,立即从集群中删除该node
         if currentReadyCondition.Status != v1.ConditionTrue && nc.cloud != nil {
            // check is node shutdowned, if yes do not deleted it. Instead add taint
            shutdown, err := nc.nodeShutdownInCloudProvider(context.TODO(), node)
            if err != nil {
               glog.Errorf("Error determining if node %v shutdown in cloud: %v", node.Name, err)
            }
            // node shutdown
            if shutdown && err == nil {
               err = controller.AddOrUpdateTaintOnNode(nc.kubeClient, node.Name, controller.ShutdownTaint)
               if err != nil {
                  glog.Errorf("Error patching node taints: %v", err)
               }
               continue
            }
            exists, err := nc.nodeExistsInCloudProvider(types.NodeName(node.Name))
            if err != nil {
               glog.Errorf("Error determining if node %v exists in cloud: %v", node.Name, err)
               continue
            }
            if !exists {
               glog.V(2).Infof("Deleting node (no longer present in cloud provider): %s", node.Name)
               nodeutil.RecordNodeEvent(nc.recorder, node.Name, string(node.UID), v1.EventTypeNormal, "DeletingNode", fmt.Sprintf("Deleting Node %v because it's not present according to cloud provider", node.Name))
               go func(nodeName string) {
                  defer utilruntime.HandleCrash()
                  // Kubelet is not reporting and Cloud Provider says node
                  // is gone. Delete it without worrying about grace
                  // periods.
                  if err := nodeutil.ForcefullyDeleteNode(nc.kubeClient, nodeName); err != nil {
                     glog.Errorf("Unable to forcefully delete node %q: %v", nodeName, err)
                  }
               }(node.Name)
            }
         }
      }
   }
   //调用nc.handleDisruption
   nc.handleDisruption(zoneToNodeConditions, nodes)

   return nil
}

NodeLifecyclecontroller.tryUpdateNodeStatus()方法

monitorNodeStatus()方法会调用tryUpdateNodeStatus()方法update集群中所有node的Status,tryUpdateNodeStatus()方法实现了以下逻辑:

  • 获取node当前的ReadCondition并存到currentReadyCondition中
  • 如果currentReadyCondition是nil,说明node是新加入进群的,因此将observedReadyCondition的NodeReady设置为Unknown,将gracePeriod设置为nc.nodeStartupGracePeriod,并将node.Status添加到nc.nodeStatusMap
  • 如果currentReadyCondition是nil,将currentReadyCondition给到observedReadyCondition,gracePeriod设为nc.nodeMonitorGracePeriod
  • 从Controller缓存的nc.nodeStatusMap取出node的status到savedNodeStatus,如果成功获得savedNodeStatus,则从savedNodeStatus中取出NodeReady Condition到savedCondition
  • 从当前的node Status取出NodeReady Condition到observedCondition
  • 满足以下一个条件,则将savedNodeStatus的status置为node.Status,probeTimestamp和readyTransitionTimestamp都置为当前时间
    • Controller缓存的nc.nodeStatusMap没有找到node
    • 从nc.nodeStatusMap正常获得node的Status到savedNodeStatus,但是savedCondition为nil且observedCondition不为nil
    • 从nc.nodeStatusMap正常获得node的Status到savedNodeStatus,但是savedCondition不为nil且observedCondition为nil
  • 如果savedCondition和observedCondition都不为nil,且savedCondition和observedCondition的LastHeartbeatTime不同,则将savedNodeStatus的status置为node.Status,probeTimestamp置为当前时间,如果savedCondition和observedCondition的LastTransitionTime不相同,则将readyTransitionTimestamp置为当前时间,否则将readyTransitionTimestamp置为savedNodeStatus.readyTransitionTimestamp
  • 将nc.nodeStatusMap中node的 Node Status更新为savedNodeStatus
  • 检查Node上报Status给Controller是否超时,即nc.nodeStatusMap中node的Node Status的probeTimestamp+ gracePeriod是否早于当前时间,如果超时,则执行以下逻辑
    • 如果currentReadyCondition为nil,说明Node是新加入集群,则添加NodeReady为Unknown的Condition到Node Status Condition
    • 如果currentReadyCondition不为nil,且observedReadyCondition不是Unknown,则说明Node之前是Ready,但是上次上报Node Ready Status已经是gracePeriod之前,即Node很长时间无响应,则添将currentReadyCondition设置为Unknown
    • 将Node的"OutOfDisk"、"MemoryPressure"和"DiskPressure" Condition也都置为Unknown状态
    • 更新node的Status,并将更新后的node的status更新到nc.nodeStatusMap中node
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:804

func (nc *Controller) tryUpdateNodeStatus(node *v1.Node) (time.Duration, v1.NodeCondition, *v1.NodeCondition, error) {
   var err error
   var gracePeriod time.Duration
   var observedReadyCondition v1.NodeCondition
   //获取node当前的ReadCondition并存到currentReadyCondition中
   _, currentReadyCondition := v1node.GetNodeCondition(&node.Status, v1.NodeReady)
   //如果currentReadyCondition是nil,说明node是新加入进群的,因此将observedReadyCondition的NodeReady设置为Unknown,将gracePeriod设置为nc.nodeStartupGracePeriod,并将node.Status添加到nc.nodeStatusMap
   if currentReadyCondition == nil {
      // If ready condition is nil, then kubelet (or nodecontroller) never posted node status.
      // A fake ready condition is created, where LastProbeTime and LastTransitionTime is set
      // to node.CreationTimestamp to avoid handle the corner case.
      observedReadyCondition = v1.NodeCondition{
         Type:               v1.NodeReady,
         Status:             v1.ConditionUnknown,
         LastHeartbeatTime:  node.CreationTimestamp,
         LastTransitionTime: node.CreationTimestamp,
      }
      gracePeriod = nc.nodeStartupGracePeriod
      nc.nodeStatusMap[node.Name] = nodeStatusData{
         status:                   node.Status,
         probeTimestamp:           node.CreationTimestamp,
         readyTransitionTimestamp: node.CreationTimestamp,
      }
   //如果currentReadyCondition是nil,将currentReadyCondition给到observedReadyCondition,gracePeriod设为nc.nodeMonitorGracePeriod
   } else {
      // If ready condition is not nil, make a copy of it, since we may modify it in place later.
      observedReadyCondition = *currentReadyCondition
      gracePeriod = nc.nodeMonitorGracePeriod
   }

   //从nc.nodeStatusMap取出node的status到savedNodeStatus
   savedNodeStatus, found := nc.nodeStatusMap[node.Name]
   // There are following cases to check:
   // - both saved and new status have no Ready Condition set - we leave everything as it is,
   // - saved status have no Ready Condition, but current one does - Controller was restarted with Node data already present in etcd,
   // - saved status have some Ready Condition, but current one does not - it's an error, but we fill it up because that's probably a good thing to do,
   // - both saved and current statuses have Ready Conditions and they have the same LastProbeTime - nothing happened on that Node, it may be
   //   unresponsive, so we leave it as it is,
   // - both saved and current statuses have Ready Conditions, they have different LastProbeTimes, but the same Ready Condition State -
   //   everything's in order, no transition occurred, we update only probeTimestamp,
   // - both saved and current statuses have Ready Conditions, different LastProbeTimes and different Ready Condition State -
   //   Ready Condition changed it state since we last seen it, so we update both probeTimestamp and readyTransitionTimestamp.
   // TODO: things to consider:
   //   - if 'LastProbeTime' have gone back in time its probably an error, currently we ignore it,
   //   - currently only correct Ready State transition outside of Node Controller is marking it ready by Kubelet, we don't check
   //     if that's the case, but it does not seem necessary.
   var savedCondition *v1.NodeCondition
   //如果上一步取savedNodeStatus成功,则从savedNodeStatus中取出NodeReady Condition到savedCondition
   if found {
      _, savedCondition = v1node.GetNodeCondition(&savedNodeStatus.status, v1.NodeReady)
   }
   //从node Status取出NodeReady Condition到observedCondition
   _, observedCondition := v1node.GetNodeCondition(&node.Status, v1.NodeReady)
   //满足以下一个条件,则将savedNodeStatus的status置为node.Status,probeTimestamp和readyTransitionTimestamp都置为当前时间
     (1)nc.nodeStatusMap没有找到node
     (2)从nc.nodeStatusMap正常获得node的Status到savedNodeStatus,但是savedCondition为nil且observedCondition不为nil
     (3)从nc.nodeStatusMap正常获得node的Status到savedNodeStatus,但是savedCondition不为nil且observedCondition为nil
   if !found {
      glog.Warningf("Missing timestamp for Node %s. Assuming now as a timestamp.", node.Name)
      savedNodeStatus = nodeStatusData{
         status:                   node.Status,
         probeTimestamp:           nc.now(),
         readyTransitionTimestamp: nc.now(),
      }
   } else if savedCondition == nil && observedCondition != nil {
      glog.V(1).Infof("Creating timestamp entry for newly observed Node %s", node.Name)
      savedNodeStatus = nodeStatusData{
         status:                   node.Status,
         probeTimestamp:           nc.now(),
         readyTransitionTimestamp: nc.now(),
      }
   } else if savedCondition != nil && observedCondition == nil {
      glog.Errorf("ReadyCondition was removed from Status of Node %s", node.Name)
      // TODO: figure out what to do in this case. For now we do the same thing as above.
      savedNodeStatus = nodeStatusData{
         status:                   node.Status,
         probeTimestamp:           nc.now(),
         readyTransitionTimestamp: nc.now(),
      }
   //如果savedCondition不为nil,observedCondition不为nil,且savedCondition和observedCondition的LastHeartbeatTime不同,则将savedNodeStatus的status置为node.Status,probeTimestamp置为当前时间,如果savedCondition和observedCondition的LastTransitionTime不相同,则将readyTransitionTimestamp置为当前时间,否则将readyTransitionTimestamp置为savedNodeStatus.readyTransitionTimestamp
   } else if savedCondition != nil && observedCondition != nil && savedCondition.LastHeartbeatTime != observedCondition.LastHeartbeatTime {
      var transitionTime metav1.Time
      // If ReadyCondition changed since the last time we checked, we update the transition timestamp to "now",
      // otherwise we leave it as it is.
      if savedCondition.LastTransitionTime != observedCondition.LastTransitionTime {
         glog.V(3).Infof("ReadyCondition for Node %s transitioned from %v to %v", node.Name, savedCondition, observedCondition)
         transitionTime = nc.now()
      } else {
         transitionTime = savedNodeStatus.readyTransitionTimestamp
      }
      if glog.V(5) {
         glog.V(5).Infof("Node %s ReadyCondition updated. Updating timestamp: %+v vs %+v.", node.Name, savedNodeStatus.status, node.Status)
      } else {
         glog.V(3).Infof("Node %s ReadyCondition updated. Updating timestamp.", node.Name)
      }
      savedNodeStatus = nodeStatusData{
         status:                   node.Status,
         probeTimestamp:           nc.now(),
         readyTransitionTimestamp: transitionTime,
      }
   }
   //更新nc.nodeStatusMap中node的 Node Status
   nc.nodeStatusMap[node.Name] = savedNodeStatus

   //Node上报Status给Controller是否超时,即nc.nodeStatusMap中node的Node Status的probeTimestamp+ gracePeriod是否早于当前时间,如果超时,则执行以下逻辑
   if nc.now().After(savedNodeStatus.probeTimestamp.Add(gracePeriod)) {
      // NodeReady condition was last set longer ago than gracePeriod, so update it to Unknown
      // (regardless of its current value) in the master.
      //如果currentReadyCondition为nil,说明Node是新加入集群,则添加NodeReady为Unknown的Condition到Node Status Condition
      if currentReadyCondition == nil {
         glog.V(2).Infof("node %v is never updated by kubelet", node.Name)
         node.Status.Conditions = append(node.Status.Conditions, v1.NodeCondition{
            Type:               v1.NodeReady,
            Status:             v1.ConditionUnknown,
            Reason:             "NodeStatusNeverUpdated",
            Message:            fmt.Sprintf("Kubelet never posted node status."),
            LastHeartbeatTime:  node.CreationTimestamp,
            LastTransitionTime: nc.now(),
         })
      //如果currentReadyCondition不为nil,且observedReadyCondition不是Unknown,则说明Node之前是Ready,但是上次上报Node Ready Status已经是gracePeriod之前,即Node很长时间无响应,则添将currentReadyCondition设置为Unknown
      } else {
         glog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v",
            node.Name, nc.now().Time.Sub(savedNodeStatus.probeTimestamp.Time), observedReadyCondition)
         if observedReadyCondition.Status != v1.ConditionUnknown {
            currentReadyCondition.Status = v1.ConditionUnknown
            currentReadyCondition.Reason = "NodeStatusUnknown"
            currentReadyCondition.Message = "Kubelet stopped posting node status."
            // LastProbeTime is the last time we heard from kubelet.
            currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime
            currentReadyCondition.LastTransitionTime = nc.now()
         }
      }

      // remaining node conditions should also be set to Unknown
      remainingNodeConditionTypes := []v1.NodeConditionType{
         v1.NodeOutOfDisk,
         v1.NodeMemoryPressure,
         v1.NodeDiskPressure,
         // We don't change 'NodeNetworkUnavailable' condition, as it's managed on a control plane level.
         // v1.NodeNetworkUnavailable,
      }

      //将Node的"OutOfDisk"、"MemoryPressure"和"DiskPressure" Condition也都置为Unknown状态
      nowTimestamp := nc.now()
      for _, nodeConditionType := range remainingNodeConditionTypes {
         _, currentCondition := v1node.GetNodeCondition(&node.Status, nodeConditionType)
         if currentCondition == nil {
            glog.V(2).Infof("Condition %v of node %v was never updated by kubelet", nodeConditionType, node.Name)
            node.Status.Conditions = append(node.Status.Conditions, v1.NodeCondition{
               Type:               nodeConditionType,
               Status:             v1.ConditionUnknown,
               Reason:             "NodeStatusNeverUpdated",
               Message:            "Kubelet never posted node status.",
               LastHeartbeatTime:  node.CreationTimestamp,
               LastTransitionTime: nowTimestamp,
            })
         } else {
            glog.V(4).Infof("node %v hasn't been updated for %+v. Last %v is: %+v",
               node.Name, nc.now().Time.Sub(savedNodeStatus.probeTimestamp.Time), nodeConditionType, currentCondition)
            if currentCondition.Status != v1.ConditionUnknown {
               currentCondition.Status = v1.ConditionUnknown
               currentCondition.Reason = "NodeStatusUnknown"
               currentCondition.Message = "Kubelet stopped posting node status."
               currentCondition.LastTransitionTime = nowTimestamp
            }
         }
      }

      //更新node的Status,并将更新后的node的status更新到nc.nodeStatusMap中node
      _, currentCondition := v1node.GetNodeCondition(&node.Status, v1.NodeReady)
      if !apiequality.Semantic.DeepEqual(currentCondition, &observedReadyCondition) {
         if _, err = nc.kubeClient.CoreV1().Nodes().UpdateStatus(node); err != nil {
            glog.Errorf("Error updating node %s: %v", node.Name, err)
            return gracePeriod, observedReadyCondition, currentReadyCondition, err
         }
         nc.nodeStatusMap[node.Name] = nodeStatusData{
            status:                   node.Status,
            probeTimestamp:           nc.nodeStatusMap[node.Name].probeTimestamp,
            readyTransitionTimestamp: nc.now(),
         }
         return gracePeriod, observedReadyCondition, currentReadyCondition, nil
      }
   }
   
   return gracePeriod, observedReadyCondition, currentReadyCondition, err
}

NodeLifecyclecontroller.handleDisruption方法

monitorNodeStatus()方法最后调用了handleDisruption方法检查集群各个zone的状态,handleDisruption方法逻辑如下:

  • 检查zoneToNodeConditions中所有Zone的State,如果有Zone的State不是stateFullDisruption,则将allAreFullyDisrupted设为False,检查zoneToNodeConditions中是否都在nc.zoneStates中存在,如果不存在则在nc.zoneStates中添加该zone,并在nc.zoneStates中将该zone设置为stateInitial
  • 检查nc.zoneStates缓存的zone的States,如果zoneToNodeConditions已经没有该zone,则直接删除;否则检查zone的States,如果不是stateFullDisruption,则将allWasFullyDisrupted设为False(注:上一步会将zoneToNodeConditions中有而nc.zoneStates中没有的zone添加到nc.zoneStates中并将值设为stateInitial,即集群中有新增加zone时allWasFullyDisrupted也会为False)
  • 如果allAreFullyDisrupted或者allWasFullyDisrupted为False,执行以下逻辑
    • 如果allAreFullyDisrupted为True,即集群由不全是stateFullDisruption变为全是stateFullDisruption,做以下处理之后return
      • 如果启动了useTaintBasedEvictions,则将node的"node.kubernetes.io/unreachable"和"node.kubernetes.io/not-ready"的NoExecute Taint删除,并从zoneNoExecuteTainter删除;如果没有启动useTaintBasedEvictions,则将node从nc.zonePodEvictor中移除,取消node上pod的eviction
      • 将nc.zoneNoExecuteTainter或zonePodEvictor的limiter全部设为0,即取消所有的eviction
      • 将nc.zoneStates中所有的zone更新为stateFullDisruption
    • 如果allWasFullyDisrupted为True,即集群由全是stateFullDisruption变为不全是stateFullDisruption,做以下处理之后return
      • 更新nc.nodeStatusMap中所有nodestatus的probeTimestamp和readyTransitionTimestamp更新为当前时间
      • 调用nc.setLimiterInZone设置每个zone的limiter
    • 如果上两个逻辑都未进入,说明集群状态保持了非stateFullDisruption,则检查nc.zoneStates每个zone的state,如果zone的stat有改变,则调用nc.setLimiterInZone设置zone的limiter,并更新zone的state
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:974

func (nc *Controller) handleDisruption(zoneToNodeConditions map[string][]*v1.NodeCondition, nodes []*v1.Node) {
   newZoneStates := map[string]ZoneState{}
   allAreFullyDisrupted := true
   //检查zoneToNodeConditions中所有Zone的State,如果有Zone的State不是stateFullDisruption,则将allAreFullyDisrupted设为False,检查zoneToNodeConditions中是否都在nc.zoneStates中存在,如果不存在则在nc.zoneStates中添加该zone,并在nc.zoneStates中将该zone设置为stateInitial
   for k, v := range zoneToNodeConditions {
      zoneSize.WithLabelValues(k).Set(float64(len(v)))
      unhealthy, newState := nc.computeZoneStateFunc(v)
      zoneHealth.WithLabelValues(k).Set(float64(100*(len(v)-unhealthy)) / float64(len(v)))
      unhealthyNodes.WithLabelValues(k).Set(float64(unhealthy))
      if newState != stateFullDisruption {
         allAreFullyDisrupted = false
      }
      newZoneStates[k] = newState
      if _, had := nc.zoneStates[k]; !had {
         glog.Errorf("Setting initial state for unseen zone: %v", k)
         nc.zoneStates[k] = stateInitial
      }
   }

   //检查nc.zoneStates缓存的zone的States,如果zoneToNodeConditions已经没有该zone,则直接删除;否则检查zone的States,如果不是stateFullDisruption,则将allWasFullyDisrupted设为False(注:上一步会将zoneToNodeConditions中有而nc.zoneStates中没有的zone添加到nc.zoneStates中并将值设为stateInitial,即集群中有新增加zone时allWasFullyDisrupted也会为False)
   allWasFullyDisrupted := true
   for k, v := range nc.zoneStates {
      if _, have := zoneToNodeConditions[k]; !have {
         zoneSize.WithLabelValues(k).Set(0)
         zoneHealth.WithLabelValues(k).Set(100)
         unhealthyNodes.WithLabelValues(k).Set(0)
         delete(nc.zoneStates, k)
         continue
      }
      if v != stateFullDisruption {
         allWasFullyDisrupted = false
         break
      }
   }

   // At least one node was responding in previous pass or in the current pass. Semantics is as follows:
   // - if the new state is "partialDisruption" we call a user defined function that returns a new limiter to use,
   // - if the new state is "normal" we resume normal operation (go back to default limiter settings),
   // - if new state is "fullDisruption" we restore normal eviction rate,
   //   - unless all zones in the cluster are in "fullDisruption" - in that case we stop all evictions.
   //如果allAreFullyDisrupted或者allWasFullyDisrupted为False,执行以下逻辑
   if !allAreFullyDisrupted || !allWasFullyDisrupted {
      // We're switching to full disruption mode
      //如果allAreFullyDisrupted为True,即集群由不全是stateFullDisruption变为全是stateFullDisruption,做以下处理之后return
(1)如果启动了useTaintBasedEvictions,则将node的"node.kubernetes.io/unreachable"和"node.kubernetes.io/not-ready"的NoExecute Taint删除,并从zoneNoExecuteTainter删除;如果没有启动useTaintBasedEvictions,则将node从nc.zonePodEvictor中移除,取消node上pod的eviction
(2)将nc.zoneNoExecuteTainter或zonePodEvictor的limiter全部设为0,即取消所有的eviction
(3)将nc.zoneStates中所有的zone更新为stateFullDisruption
      if allAreFullyDisrupted {
         glog.V(0).Info("Controller detected that all Nodes are not-Ready. Entering master disruption mode.")
         for i := range nodes {
            if nc.useTaintBasedEvictions {
               _, err := nc.markNodeAsReachable(nodes[i])
               if err != nil {
                  glog.Errorf("Failed to remove taints from Node %v", nodes[i].Name)
               }
            } else {
               nc.cancelPodEviction(nodes[i])
            }
         }
         // We stop all evictions.
         for k := range nc.zoneStates {
            if nc.useTaintBasedEvictions {
               nc.zoneNoExecuteTainter[k].SwapLimiter(0)
            } else {
               nc.zonePodEvictor[k].SwapLimiter(0)
            }
         }
         for k := range nc.zoneStates {
            nc.zoneStates[k] = stateFullDisruption
         }
         // All rate limiters are updated, so we can return early here.
         return
      }
      // We're exiting full disruption mode
      //如果allWasFullyDisrupted为True,即集群由全是stateFullDisruption变为不全是stateFullDisruption,做以下处理并return
(1)更新nc.nodeStatusMap中所有nodestatus的probeTimestamp和readyTransitionTimestamp更新为当前时间
(2)调用nc.setLimiterInZone设置每个zone的limiter
      if allWasFullyDisrupted {
         glog.V(0).Info("Controller detected that some Nodes are Ready. Exiting master disruption mode.")
         // When exiting disruption mode update probe timestamps on all Nodes.
         now := nc.now()
         for i := range nodes {
            v := nc.nodeStatusMap[nodes[i].Name]
            v.probeTimestamp = now
            v.readyTransitionTimestamp = now
            nc.nodeStatusMap[nodes[i].Name] = v
         }
         // We reset all rate limiters to settings appropriate for the given state.
         for k := range nc.zoneStates {
            nc.setLimiterInZone(k, len(zoneToNodeConditions[k]), newZoneStates[k])
            nc.zoneStates[k] = newZoneStates[k]
         }
         return
      }
      // We know that there's at least one not-fully disrupted so,
      // we can use default behavior for rate limiters
      //逻辑到这里说明集群的state前后都不是stateFullDisruption,则检查nc.zoneStates每个zone的state,如果zone的stat有改变,则调用nc.setLimiterInZone设置zone的limiter,并更新zone的state
      for k, v := range nc.zoneStates {
         newState := newZoneStates[k]
         if v == newState {
            continue
         }
         glog.V(0).Infof("Controller detected that zone %v is now in state %v.", k, newState)
         nc.setLimiterInZone(k, len(zoneToNodeConditions[k]), newState)
         nc.zoneStates[k] = newState
      }
   }
}

NodeLifecyclecontroller.setLimiterInZone方法

handleDisruption方法会调用setLimiterInZone方法来设置集群中zone的limier,那就看看setLimiterInZone方法是如何设置limiter的:

  • zone为state Normal则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为nc.evictionLimiterQPS,即--node-eviction-rate参数设置的值
  • zone为state PartialDisruption则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为0(zone的node数量小于等于nc.largeClusterThreshold)或者nc.secondaryEvictionLimiterQPS即--secondary-node-eviction-rate参数设置的值(zone的node数量大于nc.largeClusterThreshold)
  • zone为state FullDisruption则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为nc.evictionLimiterQPS,即--node-eviction-rate参数设置的值
k8s.io/kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go:1072

func (nc *Controller) setLimiterInZone(zone string, zoneSize int, state ZoneState) {
   switch state {
   //state Normal则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为nc.evictionLimiterQPS,即--node-eviction-rate参数设置的值
   case stateNormal:
      if nc.useTaintBasedEvictions {
         nc.zoneNoExecuteTainter[zone].SwapLimiter(nc.evictionLimiterQPS)
      } else {
         nc.zonePodEvictor[zone].SwapLimiter(nc.evictionLimiterQPS)
      }
   //state PartialDisruption则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为0(zone的node数量小于等于nc.largeClusterThreshold)或者nc.secondaryEvictionLimiterQPS即--secondary-node-eviction-rate参数设置的值(zone的node数量大于nc.largeClusterThreshold)
   case statePartialDisruption:
      if nc.useTaintBasedEvictions {
         nc.zoneNoExecuteTainter[zone].SwapLimiter(
            nc.enterPartialDisruptionFunc(zoneSize))
      } else {
         nc.zonePodEvictor[zone].SwapLimiter(
            nc.enterPartialDisruptionFunc(zoneSize))
      }
   //state FullDisruption则将zoneNoExecuteTainter或者zonePodEvictor的limiter设为nc.evictionLimiterQPS,即--node-eviction-rate参数设置的值
   case stateFullDisruption:
      if nc.useTaintBasedEvictions {
         nc.zoneNoExecuteTainter[zone].SwapLimiter(
            nc.enterFullDisruptionFunc(zoneSize))
      } else {
         nc.zonePodEvictor[zone].SwapLimiter(
            nc.enterFullDisruptionFunc(zoneSize))
      }
   }
}

总结

至此,node-lifecycle-controller的所有代码已经阅读完毕。node-lifecycle-controller主要是监听pod和node的add、update、delete事件,以及监控node Ready Condition Status的状态变化,更新Node的Taint以及对Node上的Pod进行eviction,具体处理如下:

  • 启动NoExecuteManager,node-lifecycle-controller监听到pod和node的add、update、delete事件之后,通过NoExecuteManager检查Pod的Toleration是否容忍所在Node的NoExecute Taints,如果未全部容忍则对Pod进行eviction(--enable-taint-manager参数为true,该参数默认值为true)
  • 监听node的add和update事件对Node的NoSchedule类型的Taint进行更新(feature gate的TaintNodesByCondition参数为true,该参数默认值为true);
  • 监控node Ready Condition Status的状态变化,对Node的NoExecute类型的Taint进行更新,并修改zoneNoExecuteTainter Queue,通过NoExecuteManager对Node上的Pod进行eviction处理(feature gate的TaintBasedEvictions参数为true,该参数默认值为false)
  • 监控node Ready Condition Status的状态变化,通过修改zonePodEvictor Queue,调用apiserver的delete接口对Node上的Pod进行eviction处理(feature gate的TaintBasedEvictions参数为false,该参数默认值为false)

转载于:https://my.oschina.net/u/3797264/blog/2885807

你可能感兴趣的:(Kubernetes 1.12.0 Kube-controller-manager之node-lifecycle-controller源码阅读分析)