Deployment Controller

deployment controller是kube-controller-manager组件中众多控制器中的一个,是 deployment 资源对象的控制器,其通过对deployment、replicaset、pod三种资源的监听,当三种资源发生变化时会触发 deployment controller 对相应的deployment资源进行调谐操作,从而完成deployment的扩缩容、暂停恢复、更新、回滚、状态status更新、所属的旧replicaset清理等操作。

  • deployment 的功能
    deployment 是 kubernetes 中用来部署无状态应用的一个对象,也是最常用的一种对象。
  • deployment、replicaSet 和 pod 之间的关系
    deployment 的本质是控制 replicaSet,replicaSet 会控制 pod,然后由 controller 驱动各个对象达到期望状态。

Deployment 与控制器模式

在 K8s 中,pod 是最小的资源单位,而 pod 的副本管理是通过 ReplicaSet(RS) 实现的;而 deployment 实则是基于 RS 做了更上层的工作。


image.png

这就是 Kubernetes 的控制器模式,顶层资源通过控制下层资源,来拓展新能力。deployment 并没有直接对 pod 进行管理,是通过管理 rs 来实现对 pod 的副本控制。deployment 通过对 rs 的控制实现了版本管理:每次发布对应一个版本,每个版本有一个 rs,在注解中标识版本号,而 rs 再每次根据 pod template 和副本数运行相应的 pod。deployment 只需要保证任何情况下 rs 的状态都在预期,rs 保证任何情况下 pod 的状态都在预期。

K8s 是怎么管理 Deployment 的

了解了 deployment 这一资源在 K8s 中的定位,我们再来看下这个资源如何达到预期状态。

Kubernetes 的 API 和控制器都是基于水平触发的,可以促进系统的自我修复和周期协调。

水平触发这个概念来自硬件的中断,中断可以是水平触发,也可以是边缘触发:

  • 水平触发 : 系统仅依赖于当前状态。即使系统错过了某个事件(可能因为故障挂掉了),当它恢复时,依然可以通过查看信号的当前状态来做出正确的响应。
  • 边缘触发 : 系统不仅依赖于当前状态,还依赖于过去的状态。如果系统错过了某个事件(“边缘”),则必须重新查看该事件才能恢复系统。

Kubernetes 水平触发的 API 实现方式是:控制器监视资源对象的实际状态,并与对象期望的状态进行对比,然后调整实际状态,使之与期望状态相匹配。
水平触发的 API 也叫声明式 API,而监控 deployment 资源对象并确定符合预期的控制器就是 deployment controller,对应的 rs 的控制器就是 rs controller。

deployment controller的大致组成和处理流程如下图,deployment controller对pod、replicaset和deployment对象注册了event handler,当有事件时,会watch到然后将对应的deployment对象放入到queue中,然后syncDeployment方法为deployment controller调谐deployment对象的核心处理逻辑所在,从queue中取出deployment对象,做调谐处理。


image.png

deployment controller分析将分为两大块进行,分别是:

  • deployment controller初始化与启动分析;
  • deployment controller处理逻辑分析。

初始化

入口函数为 /kubernetes-master/cmd/kube-controller-manager/app/apps.go

func startDeploymentController(ctx context.Context, controllerContext ControllerContext) (controller.Interface, bool, error) {
    dc, err := deployment.NewDeploymentController(
        controllerContext.InformerFactory.Apps().V1().Deployments(),
        controllerContext.InformerFactory.Apps().V1().ReplicaSets(),
        controllerContext.InformerFactory.Core().V1().Pods(),
        controllerContext.ClientBuilder.ClientOrDie("deployment-controller"),
    )
    if err != nil {
        return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
    }
    go dc.Run(ctx, int(controllerContext.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs))
    return nil, true, nil
}

首先看看 DeploymentController 在 K8s 中的定义:

type DeploymentController struct {
    rsControl     controller.RSControlInterface
    client        clientset.Interface
    eventRecorder record.EventRecorder
    syncHandler func(dKey string) error
    enqueueDeployment func(deployment *apps.Deployment)
    dLister appslisters.DeploymentLister
    rsLister appslisters.ReplicaSetLister
    podLister corelisters.PodLister
    dListerSynced cache.InformerSynced
    rsListerSynced cache.InformerSynced
    podListerSynced cache.InformerSynced
    queue workqueue.RateLimitingInterface
}

主要包括几大块内容:

  • rsControl 是一个 ReplicaSetController 的工具,用来对 rs 进行认领和弃养工作; client 就是与 APIServer 通信的 client;
  • eventRecorder 用来记录事件; syncHandler 用来处理 deployment 的同步工作; enqueueDeployment 是一个将 deployment 入 queue 的方法;
  • dLister、 rsLister、 podLister 分别用来从 shared informer store 中获取 资源的方法; dListerSynced、 rsListerSynced、 podListerSynced 分别是用来标识 shared informer store 中是否同步过;
  • queue 就是 workqueue,deployment、replicaSet、pod 发生变化时,都会将对应的 deployment 推入这个 queue, syncHandler() 方法统一从 workqueue 中处理 deployment。

从deployment.NewDeploymentController函数代码中可以看到,deployment controller注册了deployment、replicaset与pod对象的EventHandler,也即对这几个对象的event进行监听,把event放入事件队列并做处理。并且将dc.syncDeployment方法赋值给dc.syncHandler,也即注册为核心处理方法,在dc.Run方法中会调用该核心处理方法来调谐deployment对象

func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
    eventBroadcaster := record.NewBroadcaster()
    eventBroadcaster.StartStructuredLogging(0)
    eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")})

    if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
        if err := ratelimiter.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {
            return nil, err
        }
    }
    dc := &DeploymentController{
        client:        client,
        eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
        queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
    }
    dc.rsControl = controller.RealRSControl{
        KubeClient: client,
        Recorder:   dc.eventRecorder,
    }

    dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
        AddFunc:    dc.addDeployment,
        UpdateFunc: dc.updateDeployment,
        // This will enter the sync loop and no-op, because the deployment has been deleted from the store.
        DeleteFunc: dc.deleteDeployment,
    })
    rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
        AddFunc:    dc.addReplicaSet,
        UpdateFunc: dc.updateReplicaSet,
        DeleteFunc: dc.deleteReplicaSet,
    })
    podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
        DeleteFunc: dc.deletePod,
    })

    dc.syncHandler = dc.syncDeployment
    dc.enqueueDeployment = dc.enqueue

    dc.dLister = dInformer.Lister()
    dc.rsLister = rsInformer.Lister()
    dc.podLister = podInformer.Lister()
    dc.dListerSynced = dInformer.Informer().HasSynced
    dc.rsListerSynced = rsInformer.Informer().HasSynced
    dc.podListerSynced = podInformer.Informer().HasSynced
    return dc, nil
}

启动

主要看到for循环处,根据workers的值(来源于kcm启动参数concurrent-deployment-syncs配置),启动相应数量的goroutine,跑dc.worker方法,主要是调用前面讲到的deployment controller核心处理方法dc.syncDeployment。

// Run begins watching and syncing.
func (dc *DeploymentController) Run(ctx context.Context, workers int) {
    defer utilruntime.HandleCrash()
    defer dc.queue.ShutDown()

    klog.InfoS("Starting controller", "controller", "deployment")
    defer klog.InfoS("Shutting down controller", "controller", "deployment")

    if !cache.WaitForNamedCacheSync("deployment", ctx.Done(), dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
        return
    }

    for i := 0; i < workers; i++ {
        go wait.UntilWithContext(ctx, dc.worker, time.Second)
    }

    <-ctx.Done()
}
// worker runs a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the syncHandler is never invoked concurrently with the same key.
func (dc *DeploymentController) worker() {
    for dc.processNextWorkItem() {
    }
}

func (dc *DeploymentController) processNextWorkItem() bool {
    key, quit := dc.queue.Get()
    if quit {
        return false
    }
    defer dc.queue.Done(key)

    err := dc.syncHandler(key.(string))
    // 如果 err != nil,会尝试重新入队
    dc.handleErr(err, key)

    return true
}
  • workers:ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs
  • dc.syncHandler:处理资源对象变更,相同 key 不能并发处理(由 dc.queue 的实现保证了)


    image.png

DeploymentController.queue

工作队列,由 Informer 的回调负责入队,dc.worker 负责出队,实现是 client-go/util/workqueue.rateLimitingType,支持速率限制、延迟、去重的功能,由如下组件组成:

  • client-go/util/workqueue.delayingType:实现延迟功能,维护一个小顶堆,当任务延迟时间到达后,会被取出放到工作队列中
  • client-go/util/workqueue.Type:工作队列的实现,会丢弃短时间内相同的任务,确保相同的任务只有一个处于处理状态
  • client-go/util/workqueue.MaxOfRateLimiter:速率限制,由多个限制器组成,返回最坏情况的限制
  • client-go/util/workqueue.ItemExponentialFailureRateLimiter:实现 baseDelay*2^num-failures 的延迟,最大延迟为 maxDelay
  • client-go/util/workqueue.BucketRateLimiter:对 golang.org/x/time/rate.Limiter 的封装

任务入队

发生以下事件时,会将 Deployment 入队。

  • addDeployment(obj interface{}):Deployment 创建
  • updateDeployment(old, cur interface{}):Deployment 更新(包括 Status)
  • deleteDeployment(obj interface{}):Deployment 删除
  • addReplicaSet(obj interface{}):发生了 ReplicaSet 创建事件,如果有所属的 Deployment,则入队,否则根据 rs.Labels 找出所有的 Deployment 并入队
  • updateReplicaSet(old, cur interface{}):发生了 ReplicaSet 更新事件(Scale 操作、Status 更新),根据 old 与 cur 的情况,入队所属的 Deployment
  • deleteReplicaSet(obj interface{}):发生了 ReplicaSet 删除事件,入队所属的 Deployment
  • deletePod(obj interface{}):发生了 Pod 删除事件,且其所属的 Deployment 的更新策略为 Recreate,则入队该 Deployment

事件处理

概念:

  • old/new replicaset对象: 代码中将deploymen所管理的replicaset分为了old和new, 这里所谓的old 和new,是以spec.template来区分的, 与当前deployment具有相同template的 replicaset就是new(如果有多个相同的话, 取创建时间最早的那个), 其他都属于 old.
  • active: 如果一个replicaset的spec.replicas>0, 则认为这个replicaset是Active
  • saturated(饱和): 当一个replicaset的 desired==rs.spec.replicas==d.spec.replicas==rs.status.AvalaleReplicas, 则 认为这个rs是饱和的.
  • ready状态的pod
    pod对象的.status.conditions中,type为Ready的condition中,其status属性值为True,则代表该pod属于ready状态。
apiVersion: v1
kind: Pod
...yaml
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-08-04T08:47:03Z"
    status: "True"
    type: Ready
  ...

pod里的容器何时ready?kubelet会根据容器配置的readiness probe就绪探测策略,在探测成功后更新pod的status将该容器设置为ready,yaml示例如下

apiVersion: v1
kind: Pod
...
status:
  ...
  containerStatuses:
  - containerID: xxx
    image: xxx
    imageID: xxx
    lastState: {}
    name: test
    ready: true
    ...

逻辑

要逻辑都集中在 Sync(deployment_controller.go:syncDeployment)这一步里.

func (dc *DeploymentController) syncDeployment(ctx context.Context, key string) error {
    namespace, name, err := cache.SplitMetaNamespaceKey(key)
    if err != nil {
        klog.ErrorS(err, "Failed to split meta namespace cache key", "cacheKey", key)
        return err
    }

    startTime := time.Now()
    klog.V(4).InfoS("Started syncing deployment", "deployment", klog.KRef(namespace, name), "startTime", startTime)
    defer func() {
        klog.V(4).InfoS("Finished syncing deployment", "deployment", klog.KRef(namespace, name), "duration", time.Since(startTime))
    }()

    deployment, err := dc.dLister.Deployments(namespace).Get(name)
    if errors.IsNotFound(err) {
        klog.V(2).InfoS("Deployment has been deleted", "deployment", klog.KRef(namespace, name))
        return nil
    }
    if err != nil {
        return err
    }

    // Deep-copy otherwise we are mutating our cache.
    // TODO: Deep-copy only when needed.
    d := deployment.DeepCopy()

    everything := metav1.LabelSelector{}
    if reflect.DeepEqual(d.Spec.Selector, &everything) {
        dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
        if d.Status.ObservedGeneration < d.Generation {
            d.Status.ObservedGeneration = d.Generation
            dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(ctx, d, metav1.UpdateOptions{})
        }
        return nil
    }

    // List ReplicaSets owned by this Deployment, while reconciling ControllerRef
    // through adoption/orphaning.
    rsList, err := dc.getReplicaSetsForDeployment(ctx, d)
    if err != nil {
        return err
    }
    // List all Pods owned by this Deployment, grouped by their ReplicaSet.
    // Current uses of the podMap are:
    //
    // * check if a Pod is labeled correctly with the pod-template-hash label.
    // * check that no old Pods are running in the middle of Recreate Deployments.
    podMap, err := dc.getPodMapForDeployment(d, rsList)
    if err != nil {
        return err
    }

    if d.DeletionTimestamp != nil {
        return dc.syncStatusOnly(ctx, d, rsList)
    }

    // Update deployment conditions with an Unknown condition when pausing/resuming
    // a deployment. In this way, we can be sure that we won't timeout when a user
    // resumes a Deployment with a set progressDeadlineSeconds.
    if err = dc.checkPausedConditions(ctx, d); err != nil {
        return err
    }

    if d.Spec.Paused {
        return dc.sync(ctx, d, rsList)
    }

    // rollback is not re-entrant in case the underlying replica sets are updated with a new
    // revision so we should ensure that we won't proceed to update replica sets until we
    // make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
    if getRollbackTo(d) != nil {
        return dc.rollback(ctx, d, rsList)
    }

    scalingEvent, err := dc.isScalingEvent(ctx, d, rsList)
    if err != nil {
        return err
    }
    if scalingEvent {
        return dc.sync(ctx, d, rsList)
    }

    switch d.Spec.Strategy.Type {
    case apps.RecreateDeploymentStrategyType:
        return dc.rolloutRecreate(ctx, d, rsList, podMap)
    case apps.RollingUpdateDeploymentStrategyType:
        return dc.rolloutRolling(ctx, d, rsList)
    }
    return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}

主要逻辑:

  • 获取执行方法时的当前时间,并定义defer函数,用于计算该方法总执行时间,也即统计对一个 deployment 进行同步调谐操作的耗时;
  • 根据 deployment 对象的命名空间与名称,获取 deployment 对象;
  • 调用dc.getReplicaSetsForDeployment:对集群中与deployment对象相同命名空间下的所有replicaset对象做处理,若发现匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联,已关联但不匹配的则删除对应的 ownerReferences,最后获取返回集群中与 Deployment 关联匹配的 ReplicaSet对象列表;
  • 调用dc.getPodMapForDeployment:根据deployment对象的selector,获取当前 deployment 对象关联的 pod,根据 deployment 所属的 replicaset 对象的UID对 pod 进行分类并返回,返回值类型为map[types.UID][]*v1.Pod;
  • 如果 deployment 对象的 DeletionTimestamp 属性值不为空,则调用dc.syncStatusOnly,根据deployment 所属的 replicaset 对象,重新计算出 deployment 对象的status字段值并更新,调用完成后,直接return,不继续往下执行;
  • 调用dc.checkPausedConditions:检查 deployment 是否为pause状态,是则更新deployment对象的status字段值,为其添加pause相关的condition;
  • 判断deployment对象的.Spec.Paused属性值,为true时,则调用dc.sync做处理,调用完成后直接return;
  • 调用getRollbackTo检查deployment对象的annotations中是否有以下key:deprecated.deployment.rollback.to,如果有且值不为空,调用 dc.rollback 方法执行 回滚操作;
  • 调用dc.isScalingEvent:检查deployment对象是否处于 scaling 状态,是则调用dc.sync做扩缩容处理,调用完成后直接return;
  • 判断deployment对象的更新策略,当更新策略为Recreate时调用dc.rolloutRecreate做进一步处理,也即对deployment进行recreate更新处理;当更新策略为RollingUpdate时调用dc.rolloutRolling做进一步处理,也即对deployment进行滚动更新处理。


    image.png

可以看出对于 deployment 的删除、暂停恢复、扩缩容以及更新操作都是在 syncDeployment 方法中进行处理的,最终是通过调用 syncStatusOnly、sync、rollback、rolloutRecreate、rolloutRolling 这几个方法来处理的,其中 syncStatusOnly 和 sync 都是更新 Deployment 的 Status,rollback 是用来回滚的,rolloutRecreate 和 rolloutRolling 是根据不同的更新策略来更新 Deployment 的,下面就来看看这些操作的具体实现。

从 syncDeployment 中也可知以上几个操作的优先级为:
delete > pause > rollback > scale > rollout
举个例子,当在 rollout 操作时可以执行 pause 操作,在 pause 状态时也可直接执行删除操作

getReplicaSetsForDeployment

dc.getReplicaSetsForDeployment主要作用:获取集群中与 Deployment 相关的 ReplicaSet,若发现匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联,已关联但不匹配的则删除对应的 ownerReferences。

主要逻辑如下:
(1)获取deployment对象命名空间下的所有replicaset对象;
(2)调用cm.ClaimReplicaSets对replicaset做进一步处理,并最终返回与deployment匹配关联的replicaset对象列表。

func (dc *DeploymentController) getReplicaSetsForDeployment(ctx context.Context, d *apps.Deployment) ([]*apps.ReplicaSet, error) {
    // List all ReplicaSets to find those we own but that no longer match our
    // selector. They will be orphaned by ClaimReplicaSets().
    rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
    if err != nil {
        return nil, err
    }
    deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
    if err != nil {
        return nil, fmt.Errorf("deployment %s/%s has invalid label selector: %v", d.Namespace, d.Name, err)
    }
    // If any adoptions are attempted, we should first recheck for deletion with
    // an uncached quorum read sometime after listing ReplicaSets (see #42639).
    canAdoptFunc := controller.RecheckDeletionTimestamp(func(ctx context.Context) (metav1.Object, error) {
        fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(ctx, d.Name, metav1.GetOptions{})
        if err != nil {
            return nil, err
        }
        if fresh.UID != d.UID {
            return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
        }
        return fresh, nil
    })
    cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
    return cm.ClaimReplicaSets(ctx, rsList)
}

ClaimReplicaSets

遍历与deployment对象相同命名空间下的所有replicaset对象,调用m.ClaimObject做处理,m.ClaimObject的作用主要是将匹配但没有关联 deployment 的 replicaset 则通过设置 ownerReferences 字段与 deployment 关联,已关联但不匹配的则删除对应的 ownerReferences。

// pkg/controller/controller_ref_manager.go
func (m *ReplicaSetControllerRefManager) ClaimReplicaSets(sets []*apps.ReplicaSet) ([]*apps.ReplicaSet, error) {
    var claimed []*apps.ReplicaSet
    var errlist []error

    match := func(obj metav1.Object) bool {
        return m.Selector.Matches(labels.Set(obj.GetLabels()))
    }
    adopt := func(obj metav1.Object) error {
        return m.AdoptReplicaSet(obj.(*apps.ReplicaSet))
    }
    release := func(obj metav1.Object) error {
        return m.ReleaseReplicaSet(obj.(*apps.ReplicaSet))
    }

    for _, rs := range sets {
        ok, err := m.ClaimObject(rs, match, adopt, release)
        if err != nil {
            errlist = append(errlist, err)
            continue
        }
        if ok {
            claimed = append(claimed, rs)
        }
    }
    return claimed, utilerrors.NewAggregate(errlist)
}

dc.getPodMapForDeployment

dc.getPodMapForDeployment:根据deployment对象的Selector,获取当前 deployment 对象关联的 pod,根据 deployment 所属的 replicaset 对象的UID对 pod 进行分类并返回,返回值类型为map[types.UID][]*v1.Pod。

// pkg/controller/deployment/deployment_controller.go
func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID][]*v1.Pod, error) {
        // Get all Pods that potentially belong to this Deployment.
        selector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
        if err != nil {
            return nil, err
        }
        pods, err := dc.podLister.Pods(d.Namespace).List(selector)
    if err != nil {
        return nil, err
    }
    // Group Pods by their controller (if it's in rsList).
    podMap := make(map[types.UID][]*v1.Pod, len(rsList))
    for _, rs := range rsList {
        podMap[rs.UID] = []*v1.Pod{}
    }
    for _, pod := range pods {
        // Do not ignore inactive Pods because Recreate Deployments need to verify that no
        // Pods from older versions are running before spinning up new Pods.
        controllerRef := metav1.GetControllerOf(pod)
        if controllerRef == nil {
            continue
        }
        // Only append if we care about this UID.
        if _, ok := podMap[controllerRef.UID]; ok {
            podMap[controllerRef.UID] = append(podMap[controllerRef.UID], pod)
        }
    }
    return podMap, nil
}

dc.syncStatusOnly

syncDeployment 中首先处理的是删除操作,删除操作是由客户端发起的,首先会在对象的 metadata 中设置 DeletionTimestamp 字段。如果 deployment 对象的 DeletionTimestamp 属性值不为空,则调用dc.syncStatusOnly,根据deployment 所属的 replicaset 对象,重新计算出 deployment 对象的status字段值并更新,调用完成后,直接return,不继续往下执行;

// pkg/controller/deployment/sync.go
func (dc *DeploymentController) syncStatusOnly(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
    newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
    if err != nil {
        return err
    }

    allRSs := append(oldRSs, newRS)
    return dc.syncDeploymentStatus(allRSs, newRS, d)
}

// pkg/controller/deployment/sync.go
func (dc *DeploymentController) syncDeploymentStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
    newStatus := calculateStatus(allRSs, newRS, d)

    if reflect.DeepEqual(d.Status, newStatus) {
        return nil
    }

    newDeployment := d
    newDeployment.Status = newStatus
    _, err := dc.client.AppsV1().Deployments(newDeployment.Namespace).UpdateStatus(newDeployment)
    return err
}

syncDeploymentStatus 首先通过 newRS 和 allRSs 计算 deployment 当前的 status,然后和 deployment 中的 status 进行比较,若二者有差异则更新 deployment 使用最新的 status,syncDeploymentStatus 在后面的多种操作中都会被用到。关于具体如何计算出deployment对象的status,可以查看calculateStatus函数。calculateStatus 如下所示,主要是通过 allRSs 以及 deployment 的状态计算出最新的 status。

//k8s.io/kubernetes/pkg/controller/deployment/sync.go:483

func calculateStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) apps.DeploymentStatus {
    availableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)
    totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
    unavailableReplicas := totalReplicas - availableReplicas
 
    if unavailableReplicas < 0 {
        unavailableReplicas = 0
    }
 
    status := apps.DeploymentStatus{
        ObservedGeneration:  deployment.Generation,
        Replicas:            deploymentutil.GetActualReplicaCountForReplicaSets(allRSs),
        UpdatedReplicas:     deploymentutil.GetActualReplicaCountForReplicaSets([]*apps.ReplicaSet{newRS}),
        ReadyReplicas:       deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs),
        AvailableReplicas:   availableReplicas,
        UnavailableReplicas: unavailableReplicas,
        CollisionCount:      deployment.Status.CollisionCount,
    }
 
    conditions := deployment.Status.Conditions
    for i := range conditions {
        status.Conditions = append(status.Conditions, conditions[i])
    }
 
    conditions := deployment.Status.Conditions
    for i := range conditions {
        status.Conditions = append(status.Conditions, conditions[i])
    }
 
    ......
    return status

通过上述代码可知,当删除 deployment 对象时,仅仅是判断该对象中是否存在 metadata.DeletionTimestamp 字段,然后进行一次状态同步,并没有看到删除 deployment、rs、pod 对象的操作,其实删除对象并不是在此处进行而是在 kube-controller-manager 的垃圾回收器(garbagecollector controller)中完成的,对于 garbagecollector controller 会在后面的文章中进行说明,此外在删除对象时还需要指定一个删除选项(orphan、background 或者 foreground)来说明该对象如何删除。

dc.rollback

kubernetes 中的每一个 Deployment 资源都包含有 revision 这个概念,并且其 .spec.revisionHistoryLimit 字段指定了需要保留的历史版本数,默认为10,每个版本都会对应一个 rs,若发现集群中有大量 0/0 rs 时请不要删除它,这些 rs 对应的都是 deployment 的历史版本,否则会导致无法回滚。当一个 deployment 的历史 rs 数超过指定数时,deployment controller 会自动清理。

当在客户端触发回滚操作时,controller 会调用 getRollbackTo 进行判断并调用 rollback 执行对应的回滚操作。

// pkg/controller/deployment/deployment_controller.go
    if getRollbackTo(d) != nil {
        return dc.rollback(ctx, d, rsList)
    }

先调用getRollbackTo检查deployment对象的annotations中是否有以下key:deprecated.deployment.rollback.to,如果有且值不为空,调用 dc.rollback 方法执行 rollback 操作;

// pkg/controller/deployment/rollback.go
func getRollbackTo(d *apps.Deployment) *extensions.RollbackConfig {
    // Extract the annotation used for round-tripping the deprecated RollbackTo field.
    revision := d.Annotations[apps.DeprecatedRollbackTo]
    if revision == "" {
        return nil
    }
    revision64, err := strconv.ParseInt(revision, 10, 64)
    if err != nil {
        // If it's invalid, ignore it.
        return nil
    }
    return &extensions.RollbackConfig{
        Revision: revision64,
    }
}

dc.rollback主要逻辑:

  • 获取deployment的所有关联匹配的replicaset对象列表;
  • 取需要回滚的Revision;
  • 遍历上述获得的replicaset对象列表,比较Revision是否与需要回滚的Revision一致,一致则调用dc.rollbackToTemplate做回滚操作(主要是根据特定的Revision的replicaset对象,更改deployment对象的.Spec.Template);
// pkg/controller/deployment/rollback.go
func (dc *DeploymentController) rollback(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
    newRS, allOldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
    if err != nil {
        return err
    }

    allRSs := append(allOldRSs, newRS)
    rollbackTo := getRollbackTo(d)
    // If rollback revision is 0, rollback to the last revision
    if rollbackTo.Revision == 0 {
        if rollbackTo.Revision = deploymentutil.LastRevision(allRSs); rollbackTo.Revision == 0 {
            // If we still can't find the last revision, gives up rollback
            dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find last revision.")
            // Gives up rollback
            return dc.updateDeploymentAndClearRollbackTo(d)
        }
    }
    for _, rs := range allRSs {
        v, err := deploymentutil.Revision(rs)
        if err != nil {
            klog.V(4).Infof("Unable to extract revision from deployment's replica set %q: %v", rs.Name, err)
            continue
        }
        if v == rollbackTo.Revision {
            klog.V(4).Infof("Found replica set %q with desired revision %d", rs.Name, v)
            // rollback by copying podTemplate.Spec from the replica set
            // revision number will be incremented during the next getAllReplicaSetsAndSyncRevision call
            // no-op if the spec matches current deployment's podTemplate.Spec
            performedRollback, err := dc.rollbackToTemplate(d, rs)
            if performedRollback && err == nil {
                dc.emitRollbackNormalEvent(d, fmt.Sprintf("Rolled back deployment %q to revision %d", d.Name, rollbackTo.Revision))
            }
            return err
        }
    }
    dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find the revision to rollback to.")
    // Gives up rollback
    return dc.updateDeploymentAndClearRollbackTo(d)
}

rollbackToTemplate 会判断 deployment.Spec.Template 和 rs.Spec.Template 是否相等,若相等则无需回滚,否则使用 rs.Spec.Template 替换 deployment.Spec.Template,然后更新 deployment 的 spec 并清除回滚标志。

//k8s.io/kubernetes/pkg/controller/deployment/rollback.go:75

func (dc *DeploymentController) rollbackToTemplate(d *apps.Deployment, rs *apps.ReplicaSet) (bool, error) {
    performedRollback := false
    // 1、比较 d.Spec.Template 和 rs.Spec.Template 是否相等
    if !deploymentutil.EqualIgnoreHash(&d.Spec.Template, &rs.Spec.Template) {
        // 2、替换 d.Spec.Template
        deploymentutil.SetFromReplicaSetTemplate(d, rs.Spec.Template)
 
        // 3、设置 annotation
        deploymentutil.SetDeploymentAnnotationsTo(d, rs)
        performedRollback = true
    } else {
        dc.emitRollbackWarningEvent(d, deploymentutil.RollbackTemplateUnchanged, eventMsg)
    }
 
    // 4、更新 deployment 并清除回滚标志
    return performedRollback, dc.updateDeploymentAndClearRollbackTo(d)
}

回滚操作其实就是通过 revision 找到对应的 rs,然后使用 rs.Spec.Template 替换 deployment.Spec.Template 最后驱动 replicaSet 和 pod 达到期望状态即完成了回滚操作,在最新版中,这种使用注解方式指定回滚版本的方法即将被废弃。

dc.sync

下面来分析一下dc.sync方法,以下两种情况下,都会调用dc.sync,然后直接return:

  • 判断deployment的.Spec.Paused属性值是否为true,是则调用dc.sync做处理,调用完成后直接return;
  • 先调用dc.isScalingEvent,检查deployment对象是否处于 scaling 状态,是则调用dc.sync做处理,调用完成后直接return。
func (dc *DeploymentController) syncDeployment(key string) error {
    ......
    // pause 操作
    if d.Spec.Paused {
        return dc.sync(d, rsList)
    }
 
    if getRollbackTo(d) != nil {
        return dc.rollback(d, rsList)
    }
 
    // scale 操作
    scalingEvent, err := dc.isScalingEvent(d, rsList)
    if err != nil {
        return err
    }
    if scalingEvent {
        return dc.sync(d, rsList)
    }
    ......
}

关于Paused字段
deployment的.Spec.Paused为true时代表该deployment处于暂停状态,false则代表处于正常状态。当deployment处于暂停状态时,deployment对象的PodTemplateSpec的任何修改都不会触发deployment的更新,当.Spec.Paused再次赋值为false时才会触发deployment更新。

dc.sync主要逻辑:

  • 调用dc.getAllReplicaSetsAndSyncRevision获取最新的replicaset对象以及旧的replicaset对象列表;
  • 调用dc.scale,判断是否需要进行扩缩容操作,需要则进行扩缩容操作;
  • 当deployment的.Spec.Paused为true且不需要做回滚操作时,调用dc.cleanupDeployment,根据deployment配置的保留历史版本数(.Spec.RevisionHistoryLimit)以及replicaset的创建时间,把最老的旧的replicaset给删除清理掉;
  • 调用dc.syncDeploymentStatus,计算并更新deployment对象的status字段。
// pkg/controller/deployment/sync.go
// sync is responsible for reconciling deployments on scaling events or when they
// are paused.
func (dc *DeploymentController) sync(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
    newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
    if err != nil {
        return err
    }
    if err := dc.scale(d, newRS, oldRSs); err != nil {
        // If we get an error while trying to scale, the deployment will be requeued
        // so we can abort this resync
        return err
    }

    // Clean up the deployment when it's paused and no rollback is in flight.
    if d.Spec.Paused && getRollbackTo(d) == nil {
        if err := dc.cleanupDeployment(oldRSs, d); err != nil {
            return err
        }
    }

    allRSs := append(oldRSs, newRS)
    return dc.syncDeploymentStatus(allRSs, newRS, d)
}

上文已经提到过 deployment controller 在一个 syncLoop 中各种操作是有优先级,而 pause > rollback > scale > rollout,通过文章开头的命令行参数也可以看出,暂停和恢复操作只有在 rollout 时才会生效,再结合源码分析,虽然暂停操作下不会执行到 scale 相关的操作,但是 pause 与 scale 都是调用 sync 方法完成的,且在 sync 方法中会首先检查 scale 操作是否完成,也就是说在 pause 操作后并不是立即暂停所有操作,例如,当执行滚动更新操作后立即执行暂停操作,此时滚动更新的第一个周期并不会立刻停止而是会等到滚动更新的第一个周期完成后才会处于暂停状态,在下文的滚动更新一节会有例子进行详细的分析,至于 scale 操作在下文也会进行详细分析。
syncDeploymentStatus 方法以及相关的代码在上文的删除操作中已经解释过了,此处不再进行分析

dc.scale

dc.scale主要作用是处理deployment的扩缩容操作,其主要逻辑如下:

  • 调用deploymentutil.FindActiveOrLatest,判断是否只有最新的replicaset对象的副本数不为0,是则找到最新的replicaset对象,并判断其副本数是否与deployment期望副本数一致,是则直接return,否则调用dc.scaleReplicaSetAndRecordEvent更新其副本数为deployment的期望副本数;
  • 当最新的replicaset对象的副本数与deployment期望副本数一致,且旧的replicaset对象中有副本数不为0的,则从旧的replicset对象列表中找出副本数不为0的replicaset,调用dc.scaleReplicaSetAndRecordEvent将其副本数缩容为0,然后return;
  • 当最新的replicaset对象的副本数与deployment期望副本数不一致,旧的replicaset对象中有副本数不为0的,且deployment的更新策略为滚动更新,说明deployment可能正在滚动更新,则按一定的比例对新旧replicaset进行扩缩容操作,保证滚动更新的稳定性,具体逻辑可以自己分析下,这里不展开分析。
func (dc *DeploymentController) scale(......) error {
    // 1、在滚动更新过程中 第一个 rs 的 replicas 数量= maxSuger + dep.spec.Replicas ,
    // 更新完成后 pod 数量会多出 maxSurge 个,此处若检测到则应缩减回去
    if activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil {
        if *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) {
            return nil
        }
        // 2、只更新 rs annotation 以及为 deployment 设置 events
        _, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment)
        return err
    }
 
    // 3、当调用 IsSaturated 方法发现当前的 Deployment 对应的副本数量已经达到期望状态时就
    // 将所有历史版本 rs 持有的副本缩容为 0
    if deploymentutil.IsSaturated(deployment, newRS) {
        for _, old := range controller.FilterActiveReplicaSets(oldRSs) {
            if _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil {
                return err
            }
        }
        return nil
    }
 
    // 4、此时说明 当前的 rs 副本并没有达到期望状态并且存在多个活跃的 rs 对象,
    // 若 deployment 的更新策略为滚动更新,需要按照比例分别对各个活跃的 rs 进行扩容或者缩容
    if deploymentutil.IsRollingUpdate(deployment) {
        allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))
        allRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
 
        allowedSize := int32(0)
 
        // 5、计算最大可以创建出的 pod 数
        if *(deployment.Spec.Replicas) > 0 {
            allowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment)
        }
 
        // 6、计算需要扩容的 pod 数
        deploymentReplicasToAdd := allowedSize - allRSsReplicas
 
        // 7、如果 deploymentReplicasToAdd > 0,ReplicaSet 将按照从新到旧的顺序依次进行扩容;
        // 如果 deploymentReplicasToAdd < 0,ReplicaSet 将按照从旧到新的顺序依次进行缩容;
        // 若 > 0,则需要先扩容 newRS,但当在先扩容然后立刻缩容时,若 <0,则需要先删除 oldRS 的 pod
        var scalingOperation string
        switch {
        case deploymentReplicasToAdd > 0:
            sort.Sort(controller.ReplicaSetsBySizeNewer(allRSs))
            scalingOperation = "up"
 
        case deploymentReplicasToAdd < 0:
            sort.Sort(controller.ReplicaSetsBySizeOlder(allRSs))
            scalingOperation = "down"
        }
        deploymentReplicasAdded := int32(0)
        nameToSize := make(map[string]int32)
 
        // 8、遍历所有的 rs,计算每个 rs 需要扩容或者缩容到的期望副本数
        for i := range allRSs {
            rs := allRSs[i]
 
            if deploymentReplicasToAdd != 0 {
                // 9、调用 GetProportion 估算出 rs 需要扩容或者缩容的副本数
                proportion := deploymentutil.GetProportion(rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded)
 
                nameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion
                deploymentReplicasAdded += proportion
            } else {
                nameToSize[rs.Name] = *(rs.Spec.Replicas)
            }
        }
 
        // 10、遍历所有的 rs,第一个最活跃的 rs.Spec.Replicas 加上上面循环中计算出
        // 其他 rs 要加或者减的副本数,然后更新所有 rs 的 rs.Spec.Replicas
        for i := range allRSs {
            rs := allRSs[i]
 
            // 11、要扩容或者要删除的 rs 已经达到了期望状态
            if i == 0 && deploymentReplicasToAdd != 0 {
                leftover := deploymentReplicasToAdd - deploymentReplicasAdded
                nameToSize[rs.Name] = nameToSize[rs.Name] + leftover
                if nameToSize[rs.Name] < 0 {
                    nameToSize[rs.Name] = 0
                }
            }
 
            // 12、对 rs 进行 scale 操作
            if _, _, err := dc.scaleReplicaSet(rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil {
                return err
            }
        }
    }
    return nil
}
  • deploymentutil.FindActiveOrLatest:如果只有一个活跃的 RS,返回这个 RS;如果没有活跃的 RS,返回最新的 RS
    将 rs.Spec.Replicas Scale 为 deployment.Spec.Replicas
  • deploymentutil.IsSaturated:判断 newRS 是否满足 rs.Spec.Replicas == deployment.Spec.Replicas && desired == deployment.Spec.Replicas && rs.Status.AvailableReplicas == deployment.Spec.Replicas
    如果满足,将所有活跃的 oldRSs Scale 为 0
  • deploymentutil.IsRollingUpdate:如果更新策略是 apps.RollingUpdateDeploymentStrategyType,此时有多个活跃的 RS
  1. allRSsReplicas = 所有 rs.Spec.Replicas 之和
  2. deploymentReplicasToAdd = deployment.Spec.Replicas + maxSurge - allRSsReplicas
  3. 按照比例对所有活跃的 RS 进行 Scale
  4. deploymentReplicasAdded = 累加 proportion
  5. 计算每个 RS 的 proportion
  • rsFraction = rs.Spec.Replicas * (d.Spec.Replicas + MaxSurge(d)) / annotatedReplicas。如果 MaxReplicasAnnotation(rs) = oldD.Spec.Replicas + MaxSurge(oldD) 存在,则 annotatedReplicas = MaxReplicasAnnotation(rs)。否则 annotatedReplicas = d.Status.Replicas
  • 如果 deploymentReplicasToAdd > 0,那么返回 min(rsFraction, deploymentReplicasToAdd - deploymentReplicasAdded)
  • 如果 deploymentReplicasToAdd < 0,那么返回 max(rsFraction, deploymentReplicasToAdd - deploymentReplicasAdded)
    rs.Spec.Replicas = rs.Spec.Replicas + proportion

"计算每个 RS 的 proportion"具体实现为GetProportion 方法估算出 rs 需要扩容或者缩容的副本数,该方法中计算副本数的逻辑如下所示:

//k8s.io/kubernetes/pkg/controller/deployment/util/deployment_util.go:466
func GetProportion(rs *apps.ReplicaSet, d apps.Deployment, deploymentReplicasToAdd, deploymentReplicasAdded int32) int32 {
    if rs == nil || *(rs.Spec.Replicas) == 0 || deploymentReplicasToAdd == 0 || deploymentReplicasToAdd == deploymentReplicasAdded {
        return int32(0)
    }
 
    // 调用 getReplicaSetFraction 方法
    rsFraction := getReplicaSetFraction(*rs, d)
    allowed := deploymentReplicasToAdd - deploymentReplicasAdded
 
    if deploymentReplicasToAdd > 0 {
        return integer.Int32Min(rsFraction, allowed)
    }
    return integer.Int32Max(rsFraction, allowed)
}
 
func getReplicaSetFraction(rs apps.ReplicaSet, d apps.Deployment) int32 {
    if *(d.Spec.Replicas) == int32(0) {
        return -*(rs.Spec.Replicas)
    }
 
    deploymentReplicas := *(d.Spec.Replicas) + MaxSurge(d)
    annotatedReplicas, ok := getMaxReplicasAnnotation(&rs)
    if !ok {
        annotatedReplicas = d.Status.Replicas
    }
 
    // 计算 newRSSize 的公式
    newRSsize := (float64(*(rs.Spec.Replicas) * deploymentReplicas)) / float64(annotatedReplicas)
 
    // 返回最终计算出的结果
    return integer.RoundToInt32(newRSsize) - *(rs.Spec.Replicas)
}


dc.cleanupDeployment

当deployment的所有pod都是updated的和available的,而且没有旧的pod在running,则调用dc.cleanupDeployment,根据deployment配置的保留历史版本数(.Spec.RevisionHistoryLimit)以及replicaset的创建时间,把最老的旧的replicaset给删除清理掉。
如果 d.Spec.Paused && getRollbackTo(d) == nil,那么删除掉最旧的不活跃的 RS,保留最新 d.Spec.RevisionHistoryLimit 个 RS。

// pkg/controller/deployment/sync.go
func (dc *DeploymentController) cleanupDeployment(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) error {
    if !deploymentutil.HasRevisionHistoryLimit(deployment) {
        return nil
    }

    // Avoid deleting replica set with deletion timestamp set
    aliveFilter := func(rs *apps.ReplicaSet) bool {
        return rs != nil && rs.ObjectMeta.DeletionTimestamp == nil
    }
    cleanableRSes := controller.FilterReplicaSets(oldRSs, aliveFilter)

    diff := int32(len(cleanableRSes)) - *deployment.Spec.RevisionHistoryLimit
    if diff <= 0 {
        return nil
    }

    sort.Sort(controller.ReplicaSetsByCreationTimestamp(cleanableRSes))
    klog.V(4).Infof("Looking to cleanup old replica sets for deployment %q", deployment.Name)

    for i := int32(0); i < diff; i++ {
        rs := cleanableRSes[i]
        // Avoid delete replica set with non-zero replica counts
        if rs.Status.Replicas != 0 || *(rs.Spec.Replicas) != 0 || rs.Generation > rs.Status.ObservedGeneration || rs.DeletionTimestamp != nil {
            continue
        }
        klog.V(4).Infof("Trying to cleanup replica set %q for deployment %q", rs.Name, deployment.Name)
        if err := dc.client.AppsV1().ReplicaSets(rs.Namespace).Delete(rs.Name, nil); err != nil && !errors.IsNotFound(err) {
            // Return error instead of aggregating and continuing DELETEs on the theory
            // that we may be overloading the api server.
            return err
        }
    }

    return nil
}

Sample

滚动更新示例
上面的代码看起来非常的枯燥,只看源码其实并不能完全理解整个滚动升级的流程,此处举个例子说明一下:

创建一个 nginx-deployment 有10 个副本,等 10 个 pod 都启动完成后如下所示:

$ kubectl create -f nginx-dep.yaml
 
$ kubectl get rs
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-68b649bd8b   10        10        10      72m

然后更新 nginx-deployment 的镜像,默认使用滚动更新的方式:

$ kubectl set image deploy/nginx-deployment nginx-deployment=nginx:1.9.3
此时通过源码可知会计算该 deployment 的 maxSurge、maxUnavailable 和 maxAvailable 的值,分别为 3、2 和 13,计算方法如下所示:

// 向上取整为 3
maxSurge = replicas * deployment.spec.strategy.rollingUpdate.maxSurge(25%)= 2.5

// 向下取整为 2
maxUnavailable = replicas * deployment.spec.strategy.rollingUpdate.maxUnavailable(25%)= 2.5

maxAvailable = replicas(10) + MaxSurge(3) = 13
如上面代码所说,更新时首先创建 newRS,然后为其设定 replicas,计算 newRS replicas 值的方法在NewRSNewReplicas 中,此时计算出 replicas 结果为 3,然后更新 deployment 的 annotation,创建 events,本次 syncLoop 完成。等到下一个 syncLoop 时,所有 rs 的 replicas 已经达到最大值 10 + 3 = 13,此时需要 scale down oldRSs 了,scale down 的数量是通过以下公式得到的:

// 13 = 10 + 3
allPodsCount := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
 
// 8 = 10 - 2
minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
 
// ???
newRSUnavailablePodCount := *(newRS.Spec.Replicas) - newRS.Status.AvailableReplicas
 
// 13 - 8 - ???
maxScaledDown := allPodsCount - minAvailable - newRSUnavailablePodCount

allPodsCount 是 allRSs 的 replicas 之和此时为 13,minAvailable 为 8 ,newRSUnavailablePodCount 此时不确定,但是值在 [0,3] 中,此时假设 newRS 的三个 pod 还处于 containerCreating 状态,则newRSUnavailablePodCount 为 3,根据以上公式计算所知 maxScaledDown 为 2,则 oldRS 需要 scale down 2 个 pod,其 replicas 需要改为 8,此时该 syncLoop 完成。下一个 syncLoop 时在 scaleUp 处计算得知 scaleUpCount = maxTotalPods - currentPodCount,13-3-8=2, 此时 newRS 需要更新 replicase 增加 2。以此轮询直到 newRS replicas 扩容到 10,oldRSs replicas 缩容至 0。

对于上面的示例,可以使用 kubectl get rs -w 进行观察,以下为输出:

$ kubectl get  rs -w
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-68b649bd8b   10        0         0       0s
nginx-deployment-68b649bd8b   10        10        0       0s
nginx-deployment-68b649bd8b   10        10        10      13s
 
nginx-deployment-689bff574f   3         0         0       0s
 
nginx-deployment-68b649bd8b   8         10        10      14s
 
nginx-deployment-689bff574f   3         0         0       0s
nginx-deployment-689bff574f   3         3         3       1s
 
nginx-deployment-689bff574f   5         3         0       0s
 
nginx-deployment-68b649bd8b   8         8         8       14s
 
nginx-deployment-689bff574f   5         3         0       0s
nginx-deployment-689bff574f   5         5         0       0s
 
nginx-deployment-689bff574f   5         5         5       6s
......

dc.rolloutRecreat

deployment 的另一种更新策略recreate 就比较简单粗暴了,当更新策略为 Recreate 时,deployment 先将所有旧的 rs 缩容到 0,并等待所有 pod 都删除后,再创建新的 rs。

func (dc *DeploymentController) syncDeployment(key string) error {
    ......
    switch d.Spec.Strategy.Type {
    case apps.RecreateDeploymentStrategyType:
        return dc.rolloutRecreate(d, rsList, podMap)
    case apps.RollingUpdateDeploymentStrategyType:
        return dc.rolloutRolling(d, rsList)
    }
    ......
}

判断deployment对象的更新策略.Spec.Strategy.Type,当更新策略为Recreate时调用dc.rolloutRecreate做进一步处理。

dc.rolloutRecreate主要逻辑:

  • 调用dc.getAllReplicaSetsAndSyncRevision,获取最新的replicaset对象以及旧的replicaset对象列表;
  • 调用dc.scaleDownOldReplicaSetsForRecreate,缩容旧的replicaSets,将它们的副本数更新为0,当有旧的replicasets需要缩容时,调用dc.syncRolloutStatus更新deployment状态后直接return;
  • 调用oldPodsRunning函数,判断是否有属于deployment的pod还在running(pod的pod.Status.Phase属性值为Failed或Succeeded时代表该pod不在running),还在running则调用dc.syncRolloutStatus更新deployment状态并直接return;
  • 当新的replicaset对象没有被创建时,调用dc.getAllReplicaSetsAndSyncRevision来创建新的replicaset对象(注意:新创建的replicaset的副本数为0);
  • 调用dc.scaleUpNewReplicaSetForRecreate,扩容刚新创建的replicaset,更新其副本数与deployment期望副本数一致(即deployment的.Spec.Replicas属性值);
  • 调用util.DeploymentComplete,检查deployment的所有pod是否都是updated的和available的,而且没有旧的pod在running,是则继续调用dc.cleanupDeployment,根据deployment配置的保留历史版本数(.Spec.RevisionHistoryLimit)以及replicaset的创建时间,把最老的旧的replicaset给删除清理掉。
  • 调用dc.syncRolloutStatus更新deployment状态。
// pkg/controller/deployment/recreate.go
// rolloutRecreate implements the logic for recreating a replica set.
func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) error {
    // Don't create a new RS if not already existed, so that we avoid scaling up before scaling down.
    newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
    if err != nil {
        return err
    }
    allRSs := append(oldRSs, newRS)
    activeOldRSs := controller.FilterActiveReplicaSets(oldRSs)

    // scale down old replica sets.
    scaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d)
    if err != nil {
        return err
    }
    if scaledDown {
        // Update DeploymentStatus.
        return dc.syncRolloutStatus(allRSs, newRS, d)
    }

    // Do not process a deployment when it has old pods running.
    if oldPodsRunning(newRS, oldRSs, podMap) {
        return dc.syncRolloutStatus(allRSs, newRS, d)
    }

    // If we need to create a new RS, create it now.
    if newRS == nil {
        newRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
        if err != nil {
            return err
        }
        allRSs = append(oldRSs, newRS)
    }

    // scale up new replica set.
    if _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil {
        return err
    }

    if util.DeploymentComplete(d, &d.Status) {
        if err := dc.cleanupDeployment(oldRSs, d); err != nil {
            return err
        }
    }

    // Sync deployment status.
    return dc.syncRolloutStatus(allRSs, newRS, d)
}

dc.scaleDownOldReplicaSetsForRecreate:设置所有旧的 ReplicaSets Spec.Replicas = 0
dc.scaleUpNewReplicaSetForRecreate:当所有旧的 ReplicaSets 管理的 Pod 都停止后,才创建新的 RS

dc.getAllReplicaSetsAndSyncRevision

dc.getAllReplicaSetsAndSyncRevision会获取所有的旧的replicaset对象,以及最新的replicaset对象,然后返回。

关于最新的replicaset对象,怎样的replicaset对象是最新的呢?replicaset对象的pod template与deployment的一致,则代表该replicaset是最新的。
关于旧的replicaset对象,怎样的replicaset对象是旧的呢?除去最新的replicaset对象,其余的都是旧的replicaset。

// pkg/controller/deployment/sync.go
func (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) {
    _, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList)

    // Get new replica set with the updated revision number
    newRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted)
    if err != nil {
        return nil, nil, err
    }

    return newRS, allOldRSs, nil
}

判断 deployment 是否存在 newRS 是在 deploymentutil.FindNewReplicaSet 方法中进行判断的,对比 rs.Spec.Template 和 deployment.Spec.Template 中字段的 hash 值是否相等以此进行确定,在上面的几个操作中也多次用到了该方法,此处说明一下。

dc.getAllReplicaSetsAndSyncRevision() --> dc.getNewReplicaSet() --> deploymentutil.FindNewReplicaSet() --> EqualIgnoreHash()
EqualIgnoreHash 方法如下所示:

//k8s.io/kubernetes/pkg/controller/deployment/util/deployment_util.go:633

func EqualIgnoreHash(template1, template2 *v1.PodTemplateSpec) bool {
    t1Copy := template1.DeepCopy()
    t2Copy := template2.DeepCopy()
    // Remove hash labels from template.Labels before comparing
    delete(t1Copy.Labels, apps.DefaultDeploymentUniqueLabelKey)
    delete(t2Copy.Labels, apps.DefaultDeploymentUniqueLabelKey)
    return apiequality.Semantic.DeepEqual(t1Copy, t2Copy)
}

dc.syncRolloutStatus

syncRolloutStatus方法主要作用是计算出deployment的新的status属性值并更新

// pkg/controller/deployment/progress.go
func (dc *DeploymentController) syncRolloutStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
    newStatus := calculateStatus(allRSs, newRS, d)
    ...
}

oldPodsRunning

遍历deployment下所有的pod,找到属于旧的replicaset对象的pod,判断pod的状态(即pod.Status.Phase的值)是否都是Failed或Succeeded,是则代表所有旧的pod都没在running了,返回false。

// pkg/controller/deployment/recreate.go
func oldPodsRunning(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) bool {
    if oldPods := util.GetActualReplicaCountForReplicaSets(oldRSs); oldPods > 0 {
        return true
    }
    for rsUID, podList := range podMap {
        // If the pods belong to the new ReplicaSet, ignore.
        if newRS != nil && newRS.UID == rsUID {
            continue
        }
        for _, pod := range podList {
            switch pod.Status.Phase {
            case v1.PodFailed, v1.PodSucceeded:
                // Don't count pods in terminal state.
                continue
            case v1.PodUnknown:
                // This happens in situation like when the node is temporarily disconnected from the cluster.
                // If we can't be sure that the pod is not running, we have to count it.
                return true
            default:
                // Pod is not in terminal phase.
                return true
            }
        }
    }
    return false
}

dc.getAllReplicaSetsAndSyncRevision

dc.getAllReplicaSetsAndSyncRevision方法主要作用是获取最新的replicaset对象以及旧的replicaset对象列表,当传入的createIfNotExisted变量值为true且新的replicaset对象不存在时,调用dc.getNewReplicaSet时会新建replicaset对象(新建的replicaset对象副本数为0)

// pkg/controller/deployment/sync.go
func (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) {
    _, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList)

    // Get new replica set with the updated revision number
    newRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted)
    if err != nil {
        return nil, nil, err
    }

    return newRS, allOldRSs, nil
}

dc.scaleDownOldReplicaSetsForRecreate

遍历全部旧的replicaset,调用dc.scaleReplicaSetAndRecordEvent将其副本数缩容为0。

// pkg/controller/deployment/recreate.go
func (dc *DeploymentController) scaleDownOldReplicaSetsForRecreate(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
    scaled := false
    for i := range oldRSs {
        rs := oldRSs[i]
        // Scaling not required.
        if *(rs.Spec.Replicas) == 0 {
            continue
        }
        scaledRS, updatedRS, err := dc.scaleReplicaSetAndRecordEvent(rs, 0, deployment)
        if err != nil {
            return false, err
        }
        if scaledRS {
            oldRSs[i] = updatedRS
            scaled = true
        }
    }
    return scaled, nil
}

dc.scaleUpNewReplicaSetForRecreate

调用dc.scaleReplicaSetAndRecordEvent,将最新的replicset对象的副本数更新为deployment期望的副本数。

// pkg/controller/deployment/recreate.go
func (dc *DeploymentController) scaleUpNewReplicaSetForRecreate(newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
    scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)
    return scaled, err
}

dc.rolloutRolling

判断deployment对象的更新策略.Spec.Strategy.Type,当更新策略为RollingUpdate时调用dc.rolloutRolling做进一步处理。

dc.rolloutRolling主要逻辑:

  • 调用dc.getAllReplicaSetsAndSyncRevision,获取最新的replicaset对象以及旧的replicaset对象列表,当新的replicaset对象不存在时,将创建一个新的replicaset对象(副本数为0);
  • 调用dc.reconcileNewReplicaSet,调谐新的replicaset对象,根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge和现存pod数量进行计算,决定是否对新的replicaset对象进行扩容以及扩容的副本数;
  • 当新的replicaset对象副本数在调谐时被更新,则调用dc.syncRolloutStatus更新deployment状态后直接return;
  • 调用dc.reconcileOldReplicaSets,根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable、现存的Available状态的pod数量、新replicaset对象下所属的available的pod数量,决定是否对旧的replicaset对象进行缩容以及缩容的副本数;
  • 当旧的replicaset对象副本数在调谐时被更新,则调用dc.syncRolloutStatus更新deployment状态后直接return;
  • 调用util.DeploymentComplete,检查deployment的所有pod是否都是updated的和available的,而且没有旧的pod在running,是则继续调用dc.cleanupDeployment,根据deployment配置的保留历史版本数(.Spec.RevisionHistoryLimit)以及replicaset的创建时间,把最老的旧的replicaset给删除清理掉。
  • 调用dc.syncRolloutStatus更新deployment状态。
// pkg/controller/deployment/rolling.go
// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
    newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
    if err != nil {
        return err
    }
    allRSs := append(oldRSs, newRS)

    // Scale up, if we can.
    scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)
    if err != nil {
        return err
    }
    if scaledUp {
        // Update DeploymentStatus
        return dc.syncRolloutStatus(allRSs, newRS, d)
    }

    // Scale down, if we can.
    scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
    if err != nil {
        return err
    }
    if scaledDown {
        // Update DeploymentStatus
        return dc.syncRolloutStatus(allRSs, newRS, d)
    }

    if deploymentutil.DeploymentComplete(d, &d.Status) {
        if err := dc.cleanupDeployment(oldRSs, d); err != nil {
            return err
        }
    }

    // Sync deployment status
    return dc.syncRolloutStatus(allRSs, newRS, d)
}

dc.reconcileNewReplicaSet

c.reconcileNewReplicaSet主要作用是调谐新的replicaset对象,根据deployment的滚动更新策略配置和现存pod数量进行计算,决定是否对新的replicaset对象进行扩容。

主要逻辑:

  • 当新的replicaset对象的副本数与deployment声明的副本数一致,则说明该replicaset对象无需再调谐,直接return;
  • 当新的replicaset对象的副本数比deployment声明的副本数要大,则调用dc.scaleReplicaSetAndRecordEvent,将replicaset对象的副本数缩容至与deployment声明的副本数一致,然后return;
  • 当新的replicaset对象的副本数比deployment声明的副本数要小,则调用deploymentutil.NewRSNewReplicas,根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge的值计算出新replicaset对象该拥有的副本数量,并调用dc.scaleReplicaSetAndRecordEvent更新replicaset的副本数。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) reconcileNewReplicaSet(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
    if *(newRS.Spec.Replicas) == *(deployment.Spec.Replicas) {
        // Scaling not required.
        return false, nil
    }
    if *(newRS.Spec.Replicas) > *(deployment.Spec.Replicas) {
        // Scale down.
        scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)
        return scaled, err
    }
    newReplicasCount, err := deploymentutil.NewRSNewReplicas(deployment, allRSs, newRS)
    if err != nil {
        return false, err
    }
    scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, newReplicasCount, deployment)
    return scaled, err
}

NewRSNewReplicas

当deployment配置了滚动更新策略时,NewRSNewReplicas函数将根据.Spec.Strategy.RollingUpdate.MaxSurge的配置,调用intstrutil.GetValueFromIntOrPercent计算出maxSurge(代表滚动更新时可超出deployment声明的副本数的最大值),最终根据maxSurge与现存pod数量计算出新的replicaset对象该拥有的副本数。

// pkg/controller/deployment/util/deployment_util.go
func NewRSNewReplicas(deployment *apps.Deployment, allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet) (int32, error) {
    switch deployment.Spec.Strategy.Type {
    case apps.RollingUpdateDeploymentStrategyType:
        // Check if we can scale up.
        maxSurge, err := intstrutil.GetValueFromIntOrPercent(deployment.Spec.Strategy.RollingUpdate.MaxSurge, int(*(deployment.Spec.Replicas)), true)
        if err != nil {
            return 0, err
        }
        // Find the total number of pods
        currentPodCount := GetReplicaCountForReplicaSets(allRSs)
        maxTotalPods := *(deployment.Spec.Replicas) + int32(maxSurge)
        if currentPodCount >= maxTotalPods {
            // Cannot scale up.
            return *(newRS.Spec.Replicas), nil
        }
        // Scale up.
        scaleUpCount := maxTotalPods - currentPodCount
        // Do not exceed the number of desired replicas.
        scaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(*(deployment.Spec.Replicas)-*(newRS.Spec.Replicas))))
        return *(newRS.Spec.Replicas) + scaleUpCount, nil
    case apps.RecreateDeploymentStrategyType:
        return *(deployment.Spec.Replicas), nil
    default:
        return 0, fmt.Errorf("deployment type %v isn't supported", deployment.Spec.Strategy.Type)
    }
}

intstrutil.GetValueFromIntOrPercent

maxSurge的计算也不复杂,当maxSurge为百分比时,因为函数入参roundUp为true,所以计算公式为:maxSurge = ⌈deployment.Spec.Strategy.RollingUpdate.MaxSurge * deployment.Spec.Replicas / 100⌉(结果向上取整) ;
当maxSurge不为百分比时,直接返回其值。

// staging/src/k8s.io/apimachinery/pkg/util/intstr/intstr.go
func GetValueFromIntOrPercent(intOrPercent *IntOrString, total int, roundUp bool) (int, error) {
    if intOrPercent == nil {
        return 0, errors.New("nil value for IntOrString")
    }
    value, isPercent, err := getIntOrPercentValue(intOrPercent)
    if err != nil {
        return 0, fmt.Errorf("invalid value for IntOrString: %v", err)
    }
    if isPercent {
        if roundUp {
            value = int(math.Ceil(float64(value) * (float64(total)) / 100))
        } else {
            value = int(math.Floor(float64(value) * (float64(total)) / 100))
        }
    }
    return value, nil
}

dc.reconcileOldReplicaSets

dc.reconcileNewReplicaSet主要作用是调谐旧的replicaset对象,根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable和现存的Available状态的pod数量进行计算,决定是否对旧的replicaset对象进行缩容。

主要逻辑:

  • 获取旧的replicaset对象的副本数总数,如果是0,则代表旧的replicaset对象已经无法缩容,调谐完毕,直接return;
  • 调用deploymentutil.MaxUnavailable,计算获取maxUnavailable的值,即最大不可用pod数量(这里注意一点,当deployment滚动更新策略中MaxUnavailable与MaxSurge的配置值都为0时,此处计算MaxUnavailable的值时会返回1,因为这两者均为0时,无法进行滚动更新);
  • 根据MaxUnavailable的值、deployment的期望副本数、新replicaset对象的期望副本数、新replicaset对象的处于Available状态的副本数,计算出maxScaledDown即最大可缩容副本数,当maxScaledDown小于等于0,则代表目前暂不能对旧的replicaset对象进行缩容,直接return;
  • 调用dc.cleanupUnhealthyReplicas,按照replicaset的创建时间排序,先清理缩容Unhealthy的副本(如not-ready的、unscheduled的、pending的pod),具体逻辑暂不展开分析;
  • 调用dc.scaleDownOldReplicaSetsForRollingUpdate,根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable计算出旧的replicaset对象该拥有的副本数量,调用dc.scaleReplicaSetAndRecordEvent缩容旧的replicaset对象(所以这里也可以看到,dc.cleanupUnhealthyReplicas与dc.scaleDownOldReplicaSetsForRollingUpdate均有可能会对旧的replicaset进行缩容操作);
  • 如果缩容的副本数大于0,则返回true,否则返回false。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) reconcileOldReplicaSets(allRSs []*apps.ReplicaSet, oldRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
    oldPodsCount := deploymentutil.GetReplicaCountForReplicaSets(oldRSs)
    if oldPodsCount == 0 {
        // Can't scale down further
        return false, nil
    }

    allPodsCount := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
    klog.V(4).Infof("New replica set %s/%s has %d available pods.", newRS.Namespace, newRS.Name, newRS.Status.AvailableReplicas)
    maxUnavailable := deploymentutil.MaxUnavailable(*deployment)
    
    minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
    newRSUnavailablePodCount := *(newRS.Spec.Replicas) - newRS.Status.AvailableReplicas
    maxScaledDown := allPodsCount - minAvailable - newRSUnavailablePodCount
    if maxScaledDown <= 0 {
        return false, nil
    }

    // Clean up unhealthy replicas first, otherwise unhealthy replicas will block deployment
    // and cause timeout. See https://github.com/kubernetes/kubernetes/issues/16737
    oldRSs, cleanupCount, err := dc.cleanupUnhealthyReplicas(oldRSs, deployment, maxScaledDown)
    if err != nil {
        return false, nil
    }
    klog.V(4).Infof("Cleaned up unhealthy replicas from old RSes by %d", cleanupCount)

    // Scale down old replica sets, need check maxUnavailable to ensure we can scale down
    allRSs = append(oldRSs, newRS)
    scaledDownCount, err := dc.scaleDownOldReplicaSetsForRollingUpdate(allRSs, oldRSs, deployment)
    if err != nil {
        return false, nil
    }
    klog.V(4).Infof("Scaled down old RSes of deployment %s by %d", deployment.Name, scaledDownCount)

    totalScaledDown := cleanupCount + scaledDownCount
    return totalScaledDown > 0, nil
}

dc.scaleDownOldReplicaSetsForRollingUpdate

dc.scaleDownOldReplicaSetsForRollingUpdate主要逻辑是缩容旧的replicaset对象,主要逻辑如下:

  • 根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable和现存的Available状态的pod数量,计算出totalScaleDownCount,即目前需要缩容的副本数;
  • 对旧的replicaset对象按照创建时间先后排序;
  • 遍历旧的replicaset对象,根据需要缩容的副本总数,缩容replicaset。
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) scaleDownOldReplicaSetsForRollingUpdate(allRSs []*apps.ReplicaSet, oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (int32, error) {
    maxUnavailable := deploymentutil.MaxUnavailable(*deployment)

    // Check if we can scale down.
    minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
    // Find the number of available pods.
    availablePodCount := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)
    if availablePodCount <= minAvailable {
        // Cannot scale down.
        return 0, nil
    }
    klog.V(4).Infof("Found %d available pods in deployment %s, scaling down old RSes", availablePodCount, deployment.Name)

    sort.Sort(controller.ReplicaSetsByCreationTimestamp(oldRSs))

    totalScaledDown := int32(0)
    totalScaleDownCount := availablePodCount - minAvailable
    for _, targetRS := range oldRSs {
        if totalScaledDown >= totalScaleDownCount {
            // No further scaling required.
            break
        }
        if *(targetRS.Spec.Replicas) == 0 {
            // cannot scale down this ReplicaSet.
            continue
        }
        // Scale down.
        scaleDownCount := int32(integer.IntMin(int(*(targetRS.Spec.Replicas)), int(totalScaleDownCount-totalScaledDown)))
        newReplicasCount := *(targetRS.Spec.Replicas) - scaleDownCount
        if newReplicasCount > *(targetRS.Spec.Replicas) {
            return 0, fmt.Errorf("when scaling down old RS, got invalid request to scale down %s/%s %d -> %d", targetRS.Namespace, targetRS.Name, *(targetRS.Spec.Replicas), newReplicasCount)
        }
        _, _, err := dc.scaleReplicaSetAndRecordEvent(targetRS, newReplicasCount, deployment)
        if err != nil {
            return totalScaledDown, err
        }

        totalScaledDown += scaleDownCount
    }

    return totalScaledDown, nil
}

总结
deployment controller是kube-controller-manager组件中众多控制器中的一个,是 deployment 资源对象的控制器,其通过对deployment、replicaset、pod三种资源的监听,当三种资源发生变化时会触发 deployment controller 对相应的deployment资源进行调谐操作,从而完成deployment的扩缩容、暂停恢复、更新、回滚、状态status更新、所属的旧replicaset清理等操作。

其中deployment的扩缩容、暂停恢复、更新、回滚、状态status更新、所属的旧replicaset清理等操作都在deployment controller的核心处理方法syncDeployment里进行处理调用。

关于deployment更新这一块,deployment controller会根据deployment对象配置的更新策略Recreate或RollingUpdate,会调用rolloutRecreate或rolloutRolling方法来对deployment对象进行更新操作。

且经过以上的代码分析,可以看出,deployment controller并不负责deployment对象的删除,除按历史版本限制数需要清理删除多余的replicaset对象以外,deployment controller也不负责replicset对象的删除(实际上,除按历史版本限制数deployment controller需要清理删除多余的replicaset对象以外,其他的replicaset对象的删除由garbagecollector controller完成)。

deployment controller创建replicaset流程

无论deployment配置了ReCreate还是RollingUpdate的更新策略,在dc.rolloutRecreatedc.rolloutRolling的处理逻辑里,都会判断最新的replicaset对象是否存在,不存在则会创建。

在创建了deployment对象后,deployment controller会收到deployment的新增event,然后会做调谐处理,在第一次进入dc.rolloutRecreatedc.rolloutRolling的处理逻辑时,deployment所属的replicaset对象为空,所以会触发创建一个新的replicaset对象出来。

deployment ReCreate更新流程

  • 先缩容旧的replicaset,将其副本数缩容为0;
  • 等待旧的replicaset的pod全部都处于not running状态(pod的pod.Status.Phase属性值为FailedSucceeded时代表该pod处于not running状态);
  • 接着创建新的replicaset对象(注意:新创建的replicaset的实例副本数为0);
  • 随后扩容刚新创建的replicaset,更新其副本数与deployment期望副本数一致;
  • 最后等待,直至deployment的所有pod都属于最新的replicaset对象、pod数量与deployment期望副本数一致、且所有pod都处于Available状态,则deployment更新完成。

deployment RollingUpdate更新流程

  • 根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxSurge和现存pod数量进行计算,决定是否对新的replicaset对象进行扩容以及扩容的副本数;
  • 根据deployment的滚动更新策略配置.Spec.Strategy.RollingUpdate.MaxUnavailable、现存的Available状态的pod数量、新replicaset对象下所属的available的pod数量,决定是否对旧的replicaset对象进行缩容以及缩容的副本数;
  • 循环以上步骤,直至deployment的所有pod都属于最新的replicaset对象、pod数量与deployment期望副本数一致、且所有pod都处于Available状态,则deployment滚动更新完成。

deployment滚动更新速率控制解读

先来看到deployment滚动更新配置的两个关键参数:

  • .Spec.Strategy.RollingUpdate.MaxUnavailable:指定更新过程中不可用的 Pod 的个数上限。该值可以是绝对数字(例如5),也可以是deployment期望副本数的百分比(例如10%),运算公式:期望副本数乘以百分比值并向下取整。 如果maxSurge为0,则此值不能为0。MaxUnavailable默认值为 25%。该值越小,越能保证服务稳定,deployment更新越平滑。
  • .Spec.Strategy.RollingUpdate.MaxSurge:指定可以创建的超出期望 Pod 个数的 Pod 数量。此值可以是绝对数(例如5),也可以是deployment期望副本数的百分比(例如10%),运算公式:期望副本数乘以百分比值并向上取整。 如果 MaxUnavailable 为0,则此值不能为0。 MaxSurge默认值为 25%。该值越大,deployment更新速度越快。
速记:配置百分比时,maxSurge向上取整,maxUnavailable向下取整

注意:MaxUnavailable与MaxSurge不能均配置为0,但可能在运算之后这两个值均为0,这种情况下,为了保证滚动更新能正常进行,deployment controller会在滚动更新时将MaxUnavailable的值置为1去进行滚动更新。

例如,当deployment期望副本数为2、MaxSurge值为0、MaxUnavailable为1%时(MaxUnavailable为百分比,根据运算公式运算并向下取整后,取值为0,这时MaxSurge与MaxUnavailable均为0,所以在deployment滚动更新时,会将MaxUnavailable置为1去做滚动更新操作),触发滚动更新后,会立即将旧 replicaSet 副本数缩容到1,并扩容新的replicaset副本数为1。待新 Pod Available后,可以继续缩容旧有的 replicaSet副本数为0,然后扩容新的replicaset副本数为2。滚动更新期间确保Available可用的 Pods 总数在任何时候都至少为1个。

例如,当deployment期望副本数为2、MaxSurge值为1%、MaxUnavailable为0时(MaxSurge根据运算公式运算并向上取整,取值为1),触发滚动更新后,会立即扩容新的replicaset副本数为1,待新pod Available后,再缩容旧replicaset副本数为1,然后再扩容扩容新的replicaset副本数为2,待新pod Available后,再缩容旧replicaset副本数为0。滚动更新期间确保Available可用的 Pods 总数在任何时候都至少为2个。

更多示例如下:

// 2 desired, max unavailable 1%, surge 0% - should scale old(-1), then new(+1), then old(-1), then new(+1)
// 1 desired, max unavailable 1%, surge 0% - should scale old(-1), then new(+1)
// 2 desired, max unavailable 25%, surge 1% - should scale new(+1), then old(-1), then new(+1), then old(-1)
// 1 desired, max unavailable 25%, surge 1% - should scale new(+1), then old(-1)
// 2 desired, max unavailable 0%, surge 1% - should scale new(+1), then old(-1), then new(+1), then old(-1)
// 1 desired, max unavailable 0%, surge 1% - should scale new(+1), then old(-1)

你可能感兴趣的:(Deployment Controller)