[kubeflow] controller-runtime源码解析

[TODO] 使用 controller-runtime官方文档 重构一下文章的脉络。

在上一篇文章 [kubeflow] 从零搭建training-operator项目 中,我们从零搭建了一个简单的training-operator项目,最终就差完成controller的Reconcile函数逻辑。这次从TFJob的Reconcile函数为入口,探究training-operator到底是怎么工作的。在此之前,我们需要了解controller-runtime的原理。

controller-runtime源码分析

controller-runtime是社区封装的一个controller框架,借助kubebuilder等工具,开发者只需要关心Reconcile函数的实现,非常方便。下面这图不是controller-runtime,但很接近。Worker可以理解为reconciler,reconciler从工作队列中取出reconcile.request进行消耗。Readonly是指podLister,serviceLister这些。

[kubeflow] controller-runtime源码解析_第1张图片

[kubeflow] controller-runtime源码解析_第2张图片

Controller

下面controller-runtime的版本是v0.15.0,不同版本可能略有差异。下载了training-operator之后,执行go mod tidy下载依赖。使用vscode打开,找到controller相关代码通过 ctrl+左键 就可以跳转到源码了。下面关于controller-runtime的分析,我主要是参考 operator:controller-runtime 原理之控制器 这篇文章来的,因此分析过程基本差不多。

看看controller的抽象接口的定义,文件在[email protected]/pkg/controller/controller.go。核心主要是这四个函数,其中,reconcile.Reconciler也是一个抽象接口,里面只有一个Reconcile()函数的定义,即用户来实现。

// Controller implements a Kubernetes API.  A Controller manages a work queue fed reconcile.Requests
// from source.Sources.  Work is performed through the reconcile.Reconciler for each enqueued item.
// Work typically is reads and writes Kubernetes objects to make the system state match the state specified
// in the object Spec.
type Controller interface {
	// Reconciler is called to reconcile an object by Namespace/Name
	reconcile.Reconciler

	// Watch takes events provided by a Source and uses the EventHandler to
	// enqueue reconcile.Requests in response to the events.
	//
	// Watch may be provided one or more Predicates to filter events before
	// they are given to the EventHandler.  Events will be passed to the
	// EventHandler if all provided Predicates evaluate to true.
	Watch(src source.Source, eventhandler handler.EventHandler, predicates ...predicate.Predicate) error

	// Start starts the controller.  Start blocks until the context is closed or a
	// controller has an error starting.
	Start(ctx context.Context) error

	// GetLogger returns this controller logger prefilled with basic information.
	GetLogger() logr.Logger
}

看看controller的具体实现,文件在[email protected]/pkg/internal/controller/controller.go。MakeQueue用来初始化限速的工作队列Queue,成员Do则是reconciler,之后会运行用户的reconcile代码。mu是一个锁,保证同时只有一个controller在运行。Started标记controller是否在运行。startWatches用来存储所有的watchDescription对象,一个watchDescription对象包括src,handler和predicates三部分。

// Controller implements controller.Controller.
type Controller struct {
	// Name is used to uniquely identify a Controller in tracing, logging and monitoring.  Name is required.
	Name string

	// MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1.
	MaxConcurrentReconciles int

	// Reconciler is a function that can be called at any time with the Name / Namespace of an object and
	// ensures that the state of the system matches the state specified in the object.
	// Defaults to the DefaultReconcileFunc.
	Do reconcile.Reconciler

	// MakeQueue constructs the queue for this controller once the controller is ready to start.
	// This exists because the standard Kubernetes workqueues start themselves immediately, which
	// leads to goroutine leaks if something calls controller.New repeatedly.
	MakeQueue func() workqueue.RateLimitingInterface

	// Queue is an listeningQueue that listens for events from Informers and adds object keys to
	// the Queue for processing
	Queue workqueue.RateLimitingInterface

	// mu is used to synchronize Controller setup
	mu sync.Mutex

	// Started is true if the Controller has been Started
	Started bool

	// ctx is the context that was passed to Start() and used when starting watches.
	//
	// According to the docs, contexts should not be stored in a struct: https://golang.org/pkg/context,
	// while we usually always strive to follow best practices, we consider this a legacy case and it should
	// undergo a major refactoring and redesign to allow for context to not be stored in a struct.
	ctx context.Context

	// CacheSyncTimeout refers to the time limit set on waiting for cache to sync
	// Defaults to 2 minutes if not set.
	CacheSyncTimeout time.Duration

	// startWatches maintains a list of sources, handlers, and predicates to start when the controller is started.
	startWatches []watchDescription

	// LogConstructor is used to construct a logger to then log messages to users during reconciliation,
	// or for example when a watch is started.
	// Note: LogConstructor has to be able to handle nil requests as we are also using it
	// outside the context of a reconciliation.
	LogConstructor func(request *reconcile.Request) logr.Logger

	// RecoverPanic indicates whether the panic caused by reconcile should be recovered.
	RecoverPanic *bool

	// LeaderElected indicates whether the controller is leader elected or always running.
	LeaderElected *bool
}

// watchDescription contains all the information necessary to start a watch.
type watchDescription struct {
	src        source.Source
	handler    handler.EventHandler
	predicates []predicate.Predicate
}

Controller.Watch

看看Watch函数的具体实现,文件在[email protected]/pkg/internal/controller/controller.go。可以看到实际上是调用了Source.Start函数来完成。src是什么?是我们想要观察的对象,我们想观察到src对象的增删改时间并调用eventHandler相应处理。注意Watch函数运行时,并没有初始化工作队列Queue,因为src.Start之后只是使用Cache初始化informer并注册事件处理函数,之后会提到。

// Watch implements controller.Controller.
func (c *Controller) Watch(src source.Source, evthdler handler.EventHandler, prct ...predicate.Predicate) error {
	c.mu.Lock()
	defer c.mu.Unlock()

	// Controller hasn't started yet, store the watches locally and return.
	//
	// These watches are going to be held on the controller struct until the manager or user calls Start(...).
	if !c.Started {
		c.startWatches = append(c.startWatches, watchDescription{src: src, handler: evthdler, predicates: prct})
		return nil
	}

	c.LogConstructor(nil).Info("Starting EventSource", "source", src)
	return src.Start(c.ctx, evthdler, c.Queue, prct...)
}

Source也是一个抽象接口,看看定义,文件在[email protected]/pkg/source/source.go。里面只有一个Start函数的定义。

// Source is a source of events (eh.g. Create, Update, Delete operations on Kubernetes Objects, Webhook callbacks, etc)
// which should be processed by event.EventHandlers to enqueue reconcile.Requests.
//
// * Use Kind for events originating in the cluster (e.g. Pod Create, Pod Update, Deployment Update).
//
// * Use Channel for events originating outside the cluster (eh.g. GitHub Webhook callback, Polling external urls).
//
// Users may build their own Source implementations.
type Source interface {
	// Start is internal and should be called only by the Controller to register an EventHandler with the Informer
	// to enqueue reconcile.Requests.
	Start(context.Context, handler.EventHandler, workqueue.RateLimitingInterface, ...predicate.Predicate) error
}

我们再看看Watch函数是如何被training-operator调用的,文件在pkg/controller.v1/tensorflow/tfjob_controller.go。可以看到

  • Kind结构体作为实参,Kind应该是Source抽象接口的一个实现。kubeflowv1.TFJob{}就是我们想关注的资源对象,Kind对kubeflowv1.TFJob{}进行了包装。
  • handler.EnqueueRequestForObject{}是抽象接口EventHandler的具体实现,有Create,Delete,Update等函数,后面会提到。
  • predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()}是用户提供的断言函数,用于判断相关事件是否有必要推入队列,后面也会提到。
	// using onOwnerCreateFunc is easier to set defaults
	if err = c.Watch(source.Kind(mgr.GetCache(), &kubeflowv1.TFJob{}), &handler.EnqueueRequestForObject{},
		predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},
	); err != nil {
		return err
	}

看看Kind的具体实现,文件在[email protected]/pkg/internal/source/kind.go。Type便是我们想观察的具体类型,Kind为其做了封装,增加了一个实参来自manager的Cache,可以提供informer,毕竟Kind要实现Start这么重要的函数,肯定得需要相应的工具包。

// Kind is used to provide a source of events originating inside the cluster from Watches (e.g. Pod Create).
type Kind struct {
	// Type is the type of object to watch.  e.g. &v1.Pod{}
	Type client.Object

	// Cache used to watch APIs
	Cache cache.Cache

	// started may contain an error if one was encountered during startup. If its closed and does not
	// contain an error, startup and syncing finished.
	started     chan error
	startCancel func()
}

结合上面的分析可以知道,Controller.Watch实际会调用Kind.Start。Kind.Start的实现就在下面。核心函数是调用i.AddEventHandler来监听资源的变动并通过handler来进行相应的处理。使用过informer的应该对AddEventHandler不陌生,注意informer调用的增删改回调函数都是发生后才通知,也就是说资源对象已经发生了增删改事件。i就是一个informer,使用Kind.Cache.GetInformer来初始化。

// Start is internal and should be called only by the Controller to register an EventHandler with the Informer
// to enqueue reconcile.Requests.
func (ks *Kind) Start(ctx context.Context, handler handler.EventHandler, queue workqueue.RateLimitingInterface,
	// ...

	// cache.GetInformer will block until its context is cancelled if the cache was already started and it can not
	// sync that informer (most commonly due to RBAC issues).
	ctx, ks.startCancel = context.WithCancel(ctx)
	ks.started = make(chan error)
	go func() {
		var (
			i       cache.Informer
			lastErr error
		)

		// ...
		i, lastErr = ks.Cache.GetInformer(ctx, ks.Type)
		// ...

		_, err := i.AddEventHandler(NewEventHandler(ctx, queue, handler, prct).HandlerFuncs())
		// ...
	}()

	return nil
}

NewEventHandler函数初始化一个EventHandler,成员包括一个事件处理handler,一个限速的工作队列queue,还有一堆判断函数predicates。注意这里的EventHandler是一个结构体,其成员之一的handler.EventHandler是抽象接口,而这虽然命名相同,但并不无关系,不要搞混了。handler就是前面提到的handler.EnqueueRequestForObject{},predicates就是前面提到的predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},这两个都是用户提供的。OnAdd函数预先使用predicates进行判断,成功通过判断函数后,最终会调用handler的Create函数,把reconcile.request推入队列。OnUpdate和OnDelete函数,逻辑都是类似的。

// EventHandler adapts a handler.EventHandler interface to a cache.ResourceEventHandler interface.
type EventHandler struct {
	// ctx stores the context that created the event handler
	// that is used to propagate cancellation signals to each handler function.
	ctx context.Context

	handler    handler.EventHandler
	queue      workqueue.RateLimitingInterface
	predicates []predicate.Predicate
}

// HandlerFuncs converts EventHandler to a ResourceEventHandlerFuncs
// TODO: switch to ResourceEventHandlerDetailedFuncs with client-go 1.27
func (e *EventHandler) HandlerFuncs() cache.ResourceEventHandlerFuncs {
	return cache.ResourceEventHandlerFuncs{
		AddFunc:    e.OnAdd,
		UpdateFunc: e.OnUpdate,
		DeleteFunc: e.OnDelete,
	}
}

// OnAdd creates CreateEvent and calls Create on EventHandler.
func (e *EventHandler) OnAdd(obj interface{}) {
	c := event.CreateEvent{}

	// ...

	for _, p := range e.predicates {
		if !p.Create(c) {
			return
		}
	}

	// Invoke create handler
	ctx, cancel := context.WithCancel(e.ctx)
	defer cancel()
	e.handler.Create(ctx, c, e.queue)
}

看看EnqueueRequestForObject是如何实现Create函数的,文件在[email protected]/pkg/handler/enqueue.go。函数逻辑非常简单,而且顾名思义,就是把对象本身的namespace和name作为reconcile.Request推入工作队列。与之对应的还有一个enqueueRequestForOwner,这个则是把对象的owner的namespace和name作为reconcile.Request推入工作队列。对pod和service的Watch需要使用enqueueRequestForOwner,因为pod和service的结构体里面有ownRerference字段来标记其owner(TFJob);而对于TFJob本身的Watch,则使用EnqueueRequestForObject。tfjob_controller.go里面就是这样用的。

// EnqueueRequestForObject enqueues a Request containing the Name and Namespace of the object that is the source of the Event.
// (e.g. the created / deleted / updated objects Name and Namespace).  handler.EnqueueRequestForObject is used by almost all
// Controllers that have associated Resources (e.g. CRDs) to reconcile the associated Resource.
type EnqueueRequestForObject struct{}

// Create implements EventHandler.
func (e *EnqueueRequestForObject) Create(ctx context.Context, evt event.CreateEvent, q workqueue.RateLimitingInterface) {
	if evt.Object == nil {
		enqueueLog.Error(nil, "CreateEvent received with no metadata", "event", evt)
		return
	}
	q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
		Name:      evt.Object.GetName(),
		Namespace: evt.Object.GetNamespace(),
	}})
}

看看training-operator提供的断言函数predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()}。逻辑很直接,只要是TFJob,那么就判断为true。因为informer通知的时候资源已经发生了改动,因此状态标记为JobCreated。

// onOwnerCreateFunc modify creation condition.
func (r *TFJobReconciler) onOwnerCreateFunc() func(event.CreateEvent) bool {
	return func(e event.CreateEvent) bool {
		tfJob, ok := e.Object.(*kubeflowv1.TFJob)
		if !ok {
			return true
		}

		r.Scheme.Default(tfJob)
		msg := fmt.Sprintf("TFJob %s is created.", e.Object.GetName())
		logrus.Info(msg)
		trainingoperatorcommon.CreatedJobsCounterInc(tfJob.Namespace, r.GetFrameworkName())
		commonutil.UpdateJobConditions(&tfJob.Status, kubeflowv1.JobCreated, corev1.ConditionTrue, commonutil.NewReason(kubeflowv1.TFJobKind, commonutil.JobCreatedReason), msg)
		return true
	}
}

Controller.Start

Watch函数讲完了,一言以蔽之,那就是注册informer。然后我们再来看看Start函数,位置在[email protected]/pkg/internal/controller/controller.go。运行前先加锁,保证同时只有一个Controller在运行。使用MakeQueue对Queue进行初始化(Watch的时候没有初始化)。然后对startWatches里的每个对象再次执行src.Start(之前在Controller.Watch时调用了一次)。可以有MaxConcurrentReconciles个同时执行processNextWorkItem从Queue中取出reconcile.request进行消费。

// Start implements controller.Controller.
func (c *Controller) Start(ctx context.Context) error {
	// use an IIFE to get proper lock handling
	// but lock outside to get proper handling of the queue shutdown
	c.mu.Lock()
	if c.Started {
		return errors.New("controller was started more than once. This is likely to be caused by being added to a manager multiple times")
	}

	c.initMetrics()

	// Set the internal context.
	c.ctx = ctx

	c.Queue = c.MakeQueue()
	go func() {
		<-ctx.Done()
		c.Queue.ShutDown()
	}()

	wg := &sync.WaitGroup{}
	err := func() error {
		defer c.mu.Unlock()

		// TODO(pwittrock): Reconsider HandleCrash
		defer utilruntime.HandleCrash()

		// NB(directxman12): launch the sources *before* trying to wait for the
		// caches to sync so that they have a chance to register their intendeded
		// caches.
		for _, watch := range c.startWatches {
			c.LogConstructor(nil).Info("Starting EventSource", "source", fmt.Sprintf("%s", watch.src))

			if err := watch.src.Start(ctx, watch.handler, c.Queue, watch.predicates...); err != nil {
				return err
			}
		}

		// Start the SharedIndexInformer factories to begin populating the SharedIndexInformer caches
		c.LogConstructor(nil).Info("Starting Controller")

		for _, watch := range c.startWatches {
			syncingSource, ok := watch.src.(source.SyncingSource)
			if !ok {
				continue
			}

			if err := func() error {
				// use a context with timeout for launching sources and syncing caches.
				sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout)
				defer cancel()

				// WaitForSync waits for a definitive timeout, and returns if there
				// is an error or a timeout
				if err := syncingSource.WaitForSync(sourceStartCtx); err != nil {
					err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err)
					c.LogConstructor(nil).Error(err, "Could not wait for Cache to sync")
					return err
				}

				return nil
			}(); err != nil {
				return err
			}
		}

		// All the watches have been started, we can reset the local slice.
		//
		// We should never hold watches more than necessary, each watch source can hold a backing cache,
		// which won't be garbage collected if we hold a reference to it.
		c.startWatches = nil

		// Launch workers to process resources
		c.LogConstructor(nil).Info("Starting workers", "worker count", c.MaxConcurrentReconciles)
		wg.Add(c.MaxConcurrentReconciles)
		for i := 0; i < c.MaxConcurrentReconciles; i++ {
			go func() {
				defer wg.Done()
				// Run a worker thread that just dequeues items, processes them, and marks them done.
				// It enforces that the reconcileHandler is never invoked concurrently with the same object.
				for c.processNextWorkItem(ctx) {
				}
			}()
		}

		c.Started = true
		return nil
	}()
	if err != nil {
		return err
	}

	<-ctx.Done()
	c.LogConstructor(nil).Info("Shutdown signal received, waiting for all workers to finish")
	wg.Wait()
	c.LogConstructor(nil).Info("All workers finished")
	return nil
}

processNextWorkItem的代码如下,实际上调用了reconcileHandler。

// processNextWorkItem will read a single work item off the workqueue and
// attempt to process it, by calling the reconcileHandler.
func (c *Controller) processNextWorkItem(ctx context.Context) bool {
	obj, shutdown := c.Queue.Get()
	if shutdown {
		// Stop working
		return false
	}

	// We call Done here so the workqueue knows we have finished
	// processing this item. We also must remember to call Forget if we
	// do not want this work item being re-queued. For example, we do
	// not call Forget if a transient error occurs, instead the item is
	// put back on the workqueue and attempted again after a back-off
	// period.
	defer c.Queue.Done(obj)

	ctrlmetrics.ActiveWorkers.WithLabelValues(c.Name).Add(1)
	defer ctrlmetrics.ActiveWorkers.WithLabelValues(c.Name).Add(-1)

	c.reconcileHandler(ctx, obj)
	return true
}

reconcileHandler的代码如下,实际调用了Reconcile函数。

func (c *Controller) reconcileHandler(ctx context.Context, obj interface{}) {
	// Update metrics after processing each item
	reconcileStartTS := time.Now()
	defer func() {
		c.updateMetrics(time.Since(reconcileStartTS))
	}()

	// Make sure that the object is a valid request.
	req, ok := obj.(reconcile.Request)
	// ...

	log := c.LogConstructor(&req)
	reconcileID := uuid.NewUUID()

	log = log.WithValues("reconcileID", reconcileID)
	ctx = logf.IntoContext(ctx, log)
	ctx = addReconcileID(ctx, reconcileID)

	// RunInformersAndControllers the syncHandler, passing it the Namespace/Name string of the
	// resource to be synced.
	result, err := c.Reconcile(ctx, req)
	// ...
}

最终,c.Do.Reconcile(ctx, req)便是用户写的Reconclie函数。

// Reconcile implements reconcile.Reconciler.
func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (_ reconcile.Result, err error) {
	defer func() {
		if r := recover(); r != nil {
			if c.RecoverPanic != nil && *c.RecoverPanic {
				for _, fn := range utilruntime.PanicHandlers {
					fn(r)
				}
				err = fmt.Errorf("panic: %v [recovered]", r)
				return
			}

			log := logf.FromContext(ctx)
			log.Info(fmt.Sprintf("Observed a panic in reconciler: %v", r))
			panic(r)
		}
	}()
	return c.Do.Reconcile(ctx, req)
}

你可能感兴趣的:(kubernetes,kubernetes)