大家好,我是南哥,今天和大家一起阅读 k8s endpoint 控制器源码。如果本文对你有一些帮助,请帮忙转发一下!
EndpointSubset
EndpointSubset 是一组具有公共端口集的地址,扩展的端点集是 Addresses (Pod IP 地址) 和 Ports (Service 名称和端口号) 的笛卡尔积。
下面是一个典型的 EndpointSubset 示例:
Name: "test",
Subsets: [
{
Addresses: [
{
"ip": "10.10.1.1"
},
{
"ip": "10.10.2.2"
}
],
Ports: [
{
"name": "a",
"port": 8675
},
{
"name": "b",
"port": 309
}
]
}
]
将上面的 Subset 转换为对应的端点集合:
a: [ 10.10.1.1:8675, 10.10.2.2:8675 ]
b: [ 10.10.1.1:309, 10.10.2.2:309 ]
EndPointController
首先来看看 Endpoints
控制器对象,该对象是实现 Endpoints 功能的核心对象。
// Controller manages selector-based service endpoints.
type Controller struct {
client clientset.Interface
eventBroadcaster record.EventBroadcaster
eventRecorder record.EventRecorder
// serviceLister is able to list/get services and is populated by the shared informer passed to
// NewEndpointController.
serviceLister corelisters.ServiceLister
// servicesSynced returns true if the service shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
servicesSynced cache.InformerSynced
// podLister is able to list/get pods and is populated by the shared informer passed to
// NewEndpointController.
podLister corelisters.PodLister
// podsSynced returns true if the pod shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
podsSynced cache.InformerSynced
// endpointsLister is able to list/get endpoints and is populated by the shared informer passed to
// NewEndpointController.
endpointsLister corelisters.EndpointsLister
// endpointsSynced returns true if the endpoints shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
endpointsSynced cache.InformerSynced
// Services that need to be updated. A channel is inappropriate here,
// because it allows services with lots of pods to be serviced much
// more often than services with few pods; it also would cause a
// service that's inserted multiple times to be processed more than
// necessary.
queue workqueue.TypedRateLimitingInterface[string]
// workerLoopPeriod is the time between worker runs. The workers process the queue of service and pod changes.
workerLoopPeriod time.Duration
// triggerTimeTracker is an util used to compute and export the EndpointsLastChangeTriggerTime
// annotation.
triggerTimeTracker *endpointsliceutil.TriggerTimeTracker
endpointUpdatesBatchPeriod time.Duration
}
初始化
NewEndpointController
方法用于 EndPoint
控制器对象的初始化工作,并返回一个实例化对象,控制器对象同时订阅了 Service, Pod, EndPoint 三种资源的变更事件。
// NewEndpointController returns a new *Controller.
func NewEndpointController(ctx context.Context, podInformer coreinformers.PodInformer, serviceInformer coreinformers.ServiceInformer,
endpointsInformer coreinformers.EndpointsInformer, client clientset.Interface, endpointUpdatesBatchPeriod time.Duration) *Controller {
broadcaster := record.NewBroadcaster(record.WithContext(ctx))
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-controller"})
e := &Controller{
client: client,
queue: workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.DefaultTypedControllerRateLimiter[string](),
workqueue.TypedRateLimitingQueueConfig[string]{
Name: "endpoint",
},
),
workerLoopPeriod: time.Second,
}
serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.onServiceUpdate,
UpdateFunc: func(old, cur interface{}) {
e.onServiceUpdate(cur)
},
DeleteFunc: e.onServiceDelete,
})
e.serviceLister = serviceInformer.Lister()
e.servicesSynced = serviceInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.addPod,
UpdateFunc: e.updatePod,
DeleteFunc: e.deletePod,
})
e.podLister = podInformer.Lister()
e.podsSynced = podInformer.Informer().HasSynced
endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: e.onEndpointsDelete,
})
e.endpointsLister = endpointsInformer.Lister()
e.endpointsSynced = endpointsInformer.Informer().HasSynced
e.triggerTimeTracker = endpointsliceutil.NewTriggerTimeTracker()
e.eventBroadcaster = broadcaster
e.eventRecorder = recorder
e.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod
return e
}
启动控制器
根据控制器的初始化方法 NewEndpointController
的调用链路,可以找到控制器开始启动和执行的地方。
// cmd/kube-controller-manager/app/core.go
func startEndpointsController(ctx context.Context, controllerContext ControllerContext, controllerName string) (controller.Interface, bool, error) {
go endpointcontroller.NewEndpointController(
ctx,
controllerContext.InformerFactory.Core().V1().Pods(),
controllerContext.InformerFactory.Core().V1().Services(),
controllerContext.InformerFactory.Core().V1().Endpoints(),
controllerContext.ClientBuilder.ClientOrDie("endpoint-controller"),
controllerContext.ComponentConfig.EndpointController.EndpointUpdatesBatchPeriod.Duration,
).Run(ctx, int(controllerContext.ComponentConfig.EndpointController.ConcurrentEndpointSyncs))
return nil, true, nil
}
具体逻辑方法
Controller.Run
方法执行具体的初始化逻辑。
// Run will not return until stopCh is closed. workers determines how many
// endpoints will be handled in parallel.
func (e *Controller) Run(ctx context.Context, workers int) {
defer utilruntime.HandleCrash()
// Start events processing pipeline.
e.eventBroadcaster.StartStructuredLogging(3)
e.eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: e.client.CoreV1().Events("")})
defer e.eventBroadcaster.Shutdown()
defer e.queue.ShutDown()
logger := klog.FromContext(ctx)
logger.Info("Starting endpoint controller")
defer logger.Info("Shutting down endpoint controller")
if !cache.WaitForNamedCacheSync("endpoint", ctx.Done(), e.podsSynced, e.servicesSynced, e.endpointsSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.UntilWithContext(ctx, e.worker, e.workerLoopPeriod)
}
go func() {
defer utilruntime.HandleCrash()
e.checkLeftoverEndpoints()
}()
<-ctx.Done()
}
e.worker
方法本质上就是一个无限循环轮询器,不断从队列中取出 EndPoint
对象,然后进行对应的操作。
// worker runs a worker thread that just dequeues items, processes them, and
// marks them done. You may run as many of these in parallel as you wish; the
// workqueue guarantees that they will not end up processing the same service
// at the same time.
func (e *Controller) worker(ctx context.Context) {
for e.processNextWorkItem(ctx) {
}
}
func (e *Controller) processNextWorkItem(ctx context.Context) bool {
eKey, quit := e.queue.Get()
if quit {
return false
}
defer e.queue.Done(eKey)
logger := klog.FromContext(ctx)
err := e.syncService(ctx, eKey)
e.handleErr(logger, err, eKey)
return true
}
syncService
Controller
的回调处理方法是 syncService
方法,该方法是 EndPoint
控制器操作的核心方法,通过方法的命名,可以知道 EndPoint 主要关注的对象是 Service。
startTime := time.Now()
logger := klog.FromContext(ctx)
// 通过 key 解析出 Service 对象对应的 命名空间和名称
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
defer func() {
logger.V(4).Info("Finished syncing service endpoints", "service", klog.KRef(namespace, name), "startTime", time.Since(startTime))
}()
// 获取 Service 对象
service, err := e.serviceLister.Services(namespace).Get(name)
if err != nil {
if !errors.IsNotFound(err) {
return err
}
// Delete the corresponding endpoint, as the service has been deleted.
// TODO: Please note that this will delete an endpoint when a
// service is deleted. However, if we're down at the time when
// the service is deleted, we will miss that deletion, so this
// doesn't completely solve the problem. See #6877.
err = e.client.CoreV1().Endpoints(namespace).Delete(ctx, name, metav1.DeleteOptions{})
if err != nil && !errors.IsNotFound(err) {
return err
}
e.triggerTimeTracker.DeleteService(namespace, name)
return nil
}
// Service 类型为 ExternalName
// 直接返回
if service.Spec.Type == v1.ServiceTypeExternalName {
// services with Type ExternalName receive no endpoints from this controller;
// Ref: https://issues.k8s.io/105986
return nil
}
// Service 的标签选择器为 nil
// 这种情况下关联不到 EndPoint 对象
// 直接返回
if service.Spec.Selector == nil {
// services without a selector receive no endpoints from this controller;
// these services will receive the endpoints that are created out-of-band via the REST API.
return nil
}
logger.V(5).Info("About to update endpoints for service", "service", klog.KRef(namespace, name))
// 获取 Service 的标签选择器关联的 Pod 列表
pods, err := e.podLister.Pods(service.Namespace).List(labels.Set(service.Spec.Selector).AsSelectorPreValidated())
if err != nil {
// Since we're getting stuff from a local cache, it is
// basically impossible to get this error.
return err
}
// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
// state of the trigger time tracker gets updated even if the sync turns out
// to be no-op and we don't update the endpoints object.
endpointsLastChangeTriggerTime := e.triggerTimeTracker.
ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
// 初始化端点集合对象
subsets := []v1.EndpointSubset{}
// 初始化已就绪的 EndPoint 对象计数
var totalReadyEps int
// 初始化未就绪的 EndPoint 对象计数
var totalNotReadyEps int
// 遍历 Pod 列表
for _, pod := range pods {
// ShouldPodBeInEndpoints :
// pod 处于终止状态(phase == v1.PodFailed || phase == v1.PodSucceeded)
// pod IP 还未分配
// pod 正在被删除但是 includeTerminating 为 true
if !endpointsliceutil.ShouldPodBeInEndpoints(pod, service.Spec.PublishNotReadyAddresses) {
logger.V(5).Info("Pod is not included on endpoints for Service", "pod", klog.KObj(pod), "service", klog.KObj(service))
continue
}
// 实例化一个 EndpointAddress 对象
ep, err := podToEndpointAddressForService(service, pod)
if err != nil {
// this will happen, if the cluster runs with some nodes configured as dual stack and some as not
// such as the case of an upgrade..
logger.V(2).Info("Failed to find endpoint for service with ClusterIP on pod with error", "service", klog.KObj(service), "clusterIP", service.Spec.ClusterIP, "pod", klog.KObj(pod), "error", err)
continue
}
epa := *ep
if endpointsliceutil.ShouldSetHostname(pod, service) {
epa.Hostname = pod.Spec.Hostname
}
// Allow headless service not to have ports.
if len(service.Spec.Ports) == 0 {
if service.Spec.ClusterIP == api.ClusterIPNone {
// 构建一个新的对象添加到 subset中,这里 ports 为空数组
subsets, totalReadyEps, totalNotReadyEps = addEndpointSubset(logger, subsets, pod, epa, nil, service.Spec.PublishNotReadyAddresses)
// No need to repack subsets for headless service without ports.
}
} else {
for i := range service.Spec.Ports {
servicePort := &service.Spec.Ports[i]
portNum, err := podutil.FindPort(pod, servicePort)
if err != nil {
logger.V(4).Info("Failed to find port for service", "service", klog.KObj(service), "error", err)
continue
}
// 根据 Service 端口对象 + 端口号构建一个对象
epp := endpointPortFromServicePort(servicePort, portNum)
var readyEps, notReadyEps int
// 将构建好的对象追加到端点集合里
subsets, readyEps, notReadyEps = addEndpointSubset(logger, subsets, pod, epa, epp, service.Spec.PublishNotReadyAddresses)
// 累加已就绪的 EndPoint 对象计数
totalReadyEps = totalReadyEps + readyEps
// 累加未就绪的 EndPoint 对象计数
totalNotReadyEps = totalNotReadyEps + notReadyEps
}
}
}
// 计算并确定最后的 EndPoint 对象集合 (新的 EndPoint Set)
subsets = endpoints.RepackSubsets(subsets)
// 通过 informer 获取 Service 对象对应的 EndPoint Set
// 也就是当前的 EndPoint Set (旧的 EndPoint Set)
// See if there's actually an update here.
currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
if err != nil {
if !errors.IsNotFound(err) {
return err
}
currentEndpoints = &v1.Endpoints{
ObjectMeta: metav1.ObjectMeta{
Name: service.Name,
Labels: service.Labels,
},
}
}
// 如果 Service 的资源版本号未设置,就需要创建新的 EndPoints
createEndpoints := len(currentEndpoints.ResourceVersion) == 0
// Compare the sorted subsets and labels
// Remove the HeadlessService label from the endpoints if it exists,
// as this won't be set on the service itself
// and will cause a false negative in this diff check.
// But first check if it has that label to avoid expensive copies.
compareLabels := currentEndpoints.Labels
if _, ok := currentEndpoints.Labels[v1.IsHeadlessService]; ok {
compareLabels = utillabels.CloneAndRemoveLabel(currentEndpoints.Labels, v1.IsHeadlessService)
}
// When comparing the subsets, we ignore the difference in ResourceVersion of Pod to avoid unnecessary Endpoints
// updates caused by Pod updates that we don't care, e.g. annotation update.
// 对新的和旧的 EndPoint Set进行排序 + 比较操作
// 如果新的 Set 和旧的 Set 比较之后,没有任何差异
// 并且 Service 的版本号也不需要创建
// 直接返回就可以了
if !createEndpoints &&
endpointSubsetsEqualIgnoreResourceVersion(currentEndpoints.Subsets, subsets) &&
apiequality.Semantic.DeepEqual(compareLabels, service.Labels) &&
capacityAnnotationSetCorrectly(currentEndpoints.Annotations, currentEndpoints.Subsets) {
logger.V(5).Info("endpoints are equal, skipping update", "service", klog.KObj(service))
return nil
}
// 深度拷贝当前的 EndPoint Set
// 重新设置相关的 (最新) 属性
newEndpoints := currentEndpoints.DeepCopy()
newEndpoints.Subsets = subsets
newEndpoints.Labels = service.Labels
if newEndpoints.Annotations == nil {
newEndpoints.Annotations = make(map[string]string)
}
if !endpointsLastChangeTriggerTime.IsZero() {
newEndpoints.Annotations[v1.EndpointsLastChangeTriggerTime] =
endpointsLastChangeTriggerTime.UTC().Format(time.RFC3339Nano)
} else { // No new trigger time, clear the annotation.
delete(newEndpoints.Annotations, v1.EndpointsLastChangeTriggerTime)
}
if truncateEndpoints(newEndpoints) {
newEndpoints.Annotations[v1.EndpointsOverCapacity] = truncated
} else {
delete(newEndpoints.Annotations, v1.EndpointsOverCapacity)
}
if newEndpoints.Labels == nil {
newEndpoints.Labels = make(map[string]string)
}
if !helper.IsServiceIPSet(service) {
newEndpoints.Labels = utillabels.CloneAndAddLabel(newEndpoints.Labels, v1.IsHeadlessService, "")
} else {
newEndpoints.Labels = utillabels.CloneAndRemoveLabel(newEndpoints.Labels, v1.IsHeadlessService)
}
logger.V(4).Info("Update endpoints", "service", klog.KObj(service), "readyEndpoints", totalReadyEps, "notreadyEndpoints", totalNotReadyEps)
if createEndpoints {
// No previous endpoints, create them
// 创建新的 EndPoints
_, err = e.client.CoreV1().Endpoints(service.Namespace).Create(ctx, newEndpoints, metav1.CreateOptions{})
} else {
// Pre-existing
// 更新已有 EndPoints
_, err = e.client.CoreV1().Endpoints(service.Namespace).Update(ctx, newEndpoints, metav1.UpdateOptions{})
}
if err != nil {
if createEndpoints && errors.IsForbidden(err) {
// A request is forbidden primarily for two reasons:
// 1. namespace is terminating, endpoint creation is not allowed by default.
// 2. policy is misconfigured, in which case no service would function anywhere.
// Given the frequency of 1, we log at a lower level.
logger.V(5).Info("Forbidden from creating endpoints", "error", err)
// If the namespace is terminating, creates will continue to fail. Simply drop the item.
if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
return nil
}
}
if createEndpoints {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToCreateEndpoint", "Failed to create endpoint for service %v/%v: %v", service.Namespace, service.Name, err)
} else {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToUpdateEndpoint", "Failed to update endpoint %v/%v: %v", service.Namespace, service.Name, err)
}
return err
}
return nil
}
通过 `Controller.syncService` 方法的源代码,我们可以看到: `EndPoint` 对象每次同步时,都会执行如下的操作:
1. 根据参数 key 获取指定的 Service 对象
2. 获取 Service 对象的标签选择器关联的 Pod 列表
3. 通过 Service 和 Pod 列表计算出最新的 EndPoint 对象 (新) 集合
4. 通过 informer 获取 Service 对象对应的 EndPoint 对象 (旧) 集合
5. 如果新集合与旧集合对比,没有任何差异,说明不需要更新,直接退出方法即可
6. 根据 Service 资源版本号确定 EndPoints 对象的操作 (创建或更新) 并执行
---
description: endpointSlice
---
# 4.11 endpointSlice controller
EndpointSlice 是什么?相比于我们熟知的 endpoint ,有什么区别?
这里我们可以查看官方文档:
[https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/0752-endpointslices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/0752-endpointslices)
\
**使用Endpoints API,服务只有一个Endpoints资源**。这意味着它需要能够为支持相应服务的每个Pod存储IP地址和端口(网络端点)。这耗费了巨大的API资源。为了解决此问题,kube-proxy在每个节点上运行,并会监视Endpoints资源的任何更新。如果在Endpoints资源中甚至只有一个网络端点发生了更改,则整个对象也必须发送到kube-proxy的每个实例。
Endpoints API的另一个限制是它限制了可以为服务跟踪的网络端点的数量。**存储在etcd中的对象的默认大小限制为1.5MB。在某些情况下,可能会将Endpoints资源限制为5,000个Pod IP。**对于大多数没有超过5000个pod的用户而言,这不是问题,但是对于服务接近此大小的用户而言,这将成为一个重大问题。
为了说明这些问题在多大程度上变得重要,举一个简单的例子是有帮助的。考虑具有5,000个Pod的服务,它最终可能具有1.5MB的端点资源。如果该列表中的单个网络端点都发生了更改,则需要将完整的端点资源分配给集群中的每个节点。在具有3,000个节点的大型群集中,这成为一个很大的问题。每次更新将涉及跨集群发送4.5GB数据(1.5MB端点\* 3,000个节点)。这几乎足以耗费大量资源,并且每次端点更改都会发生这种情况。想象一下,如果滚动更新会导致全部5,000个Pod都被替换,那么传输的数据量超过22TB(等同于5000张DVD存储量)
## 使用EndpointSlice API拆分端点
EndpointSlice API旨在通过类似于分片的方法来解决此问题。我们没有使用单个Endpoints资源跟踪服务的所有Pod IP,而是将它们拆分为多个较小的EndpointSlice。
考虑一个示例,其中一个服务由15个容器支持。我们最终将获得一个跟踪所有端点的单个Endpoints资源。如果将EndpointSlices配置为每个存储5个端点,则最终将得到3个不同的EndpointSlices:
默认情况下,EndpointSlices每个存储多达100个端点,尽管可以使用--max-endpoints-per-slicekube-controller-manager上的标志进行配置。
### 入口函数
入口函数位于 cmd/kube-controller-manager/app/discovery.go
func startEndpointSliceController(ctx context.Context, controllerContext ControllerContext, controllerName string) (controller.Interface, bool, error) {
go endpointslicecontroller.NewController(
ctx,
controllerContext.InformerFactory.Core().V1().Pods(),
controllerContext.InformerFactory.Core().V1().Services(),
controllerContext.InformerFactory.Core().V1().Nodes(),
controllerContext.InformerFactory.Discovery().V1().EndpointSlices(),
controllerContext.ComponentConfig.EndpointSliceController.MaxEndpointsPerSlice,
controllerContext.ClientBuilder.ClientOrDie("endpointslice-controller"),
controllerContext.ComponentConfig.EndpointSliceController.EndpointUpdatesBatchPeriod.Duration,
).Run(ctx, int(controllerContext.ComponentConfig.EndpointSliceController.ConcurrentServiceEndpointSyncs))
return nil, true, nil
}
### 构造函数
* maxEndpointsPerSlice 每组切片的最大 endpoint 数量。
* triggerTimeTracker 计算 service 和 pods 最后一次更新时间,并存到缓存,然会 2 者中最后一次更新的时间
* reconciler 控制器的核心逻辑所在
* features.TopologyAwareHints 是否开启拓扑感知提示特性,就近路由,比如节点 A B 属于同一区域,C D 属于另一个区域,pod 在 A B C D 节点上各有一个,查看 A B 节点上面的 ipvs 规则,会发现,通往该 pod service 的流量的 ipvs 后端,只有 A B 节点上的 pod ip ,C D 同理 ,可以参考这篇文章,说得很直白:[Kubernetes Service 开启拓扑感知(就近访问)能力](https://blog.csdn.net/shida_csdn/article/details/124285905)。
// NewController creates and initializes a new Controller
func NewController(ctx context.Context, podInformer coreinformers.PodInformer,
serviceInformer coreinformers.ServiceInformer,
nodeInformer coreinformers.NodeInformer,
endpointSliceInformer discoveryinformers.EndpointSliceInformer,
maxEndpointsPerSlice int32,
client clientset.Interface,
endpointUpdatesBatchPeriod time.Duration,
) *Controller {
broadcaster := record.NewBroadcaster(record.WithContext(ctx))
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-slice-controller"})
endpointslicemetrics.RegisterMetrics()
c := &Controller{
client: client,
// This is similar to the DefaultControllerRateLimiter, just with a
// significantly higher default backoff (1s vs 5ms). This controller
// processes events that can require significant EndpointSlice changes,
// such as an update to a Service or Deployment. A more significant
// rate limit back off here helps ensure that the Controller does not
// overwhelm the API Server.
queue: workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.NewTypedMaxOfRateLimiter(
workqueue.NewTypedItemExponentialFailureRateLimiter[string](defaultSyncBackOff, maxSyncBackOff),
// 10 qps, 100 bucket size. This is only for retry speed and its
// only the overall factor (not per item).
&workqueue.TypedBucketRateLimiter[string]{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
),
workqueue.TypedRateLimitingQueueConfig[string]{
Name: "endpoint_slice",
},
),
workerLoopPeriod: time.Second,
}
serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.onServiceUpdate,
UpdateFunc: func(old, cur interface{}) {
c.onServiceUpdate(cur)
},
DeleteFunc: c.onServiceDelete,
})
c.serviceLister = serviceInformer.Lister()
c.servicesSynced = serviceInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.addPod,
UpdateFunc: c.updatePod,
DeleteFunc: c.deletePod,
})
c.podLister = podInformer.Lister()
c.podsSynced = podInformer.Informer().HasSynced
c.nodeLister = nodeInformer.Lister()
c.nodesSynced = nodeInformer.Informer().HasSynced
logger := klog.FromContext(ctx)
endpointSliceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.onEndpointSliceAdd,
UpdateFunc: func(oldObj, newObj interface{}) {
c.onEndpointSliceUpdate(logger, oldObj, newObj)
},
DeleteFunc: c.onEndpointSliceDelete,
})
c.endpointSliceLister = endpointSliceInformer.Lister()
c.endpointSlicesSynced = endpointSliceInformer.Informer().HasSynced
c.endpointSliceTracker = endpointsliceutil.NewEndpointSliceTracker()
c.maxEndpointsPerSlice = maxEndpointsPerSlice
c.triggerTimeTracker = endpointsliceutil.NewTriggerTimeTracker()
c.eventBroadcaster = broadcaster
c.eventRecorder = recorder
c.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod
if utilfeature.DefaultFeatureGate.Enabled(features.TopologyAwareHints) {
nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
c.addNode(logger, obj)
},
UpdateFunc: func(oldObj, newObj interface{}) {
c.updateNode(logger, oldObj, newObj)
},
DeleteFunc: func(obj interface{}) {
c.deleteNode(logger, obj)
},
})
c.topologyCache = topologycache.NewTopologyCache()
}
c.reconciler = endpointslicerec.NewReconciler(
c.client,
c.nodeLister,
c.maxEndpointsPerSlice,
c.endpointSliceTracker,
c.topologyCache,
c.eventRecorder,
controllerName,
endpointslicerec.WithTrafficDistributionEnabled(utilfeature.DefaultFeatureGate.Enabled(features.ServiceTrafficDistribution)),
)
return c
}
### 监听
> 监听 service pod node endpointSlice 对象。
#### service 对象
* AddFunc
onServiceUpdate 缓存 service Selector ,并加入令牌桶队列。
* UpdateFunc
onServiceUpdate 缓存 service Selector ,并加入令牌桶队列。
* DeleteFunc
onServiceDelete 删除缓存的 service Selector ,并加入令牌桶队列。
#### pod 对象
* AddFunc
addPod
根据 pod 获取 service 对象,并把对应的 service 加入到延迟队列。
* UpdateFunc
updatePod 同上。
* DeleteFunc
deletePod
如果 pod 对象不为 nil ,调用 addPod 事件函数处理。
#### node 对象
只有启用了 TopologyAwareHints 特性,才有对应的监听事件。
* addNode
调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
* updateNode
检查节点状态,调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
* deleteNode
调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
#### endpointSlice 对象
* AddFunc
onEndpointSliceAdd
调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
* UpdateFunc
onEndpointSliceUpdate
最终调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
* DeleteFunc
onEndpointSliceDelete
判断是否需要被删除,如果不希望被删除,则调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
### syncService
核心逻辑入口 syncService ,实际最终调用的是 r.finalize() 函数。
// serviceQueueWorker runs a worker thread that just dequeues items, processes
// them, and marks them done. You may run as many of these in parallel as you
// wish; the workqueue guarantees that they will not end up processing the same
// service at the same time
func (c *Controller) serviceQueueWorker(logger klog.Logger) {
for c.processNextServiceWorkItem(logger) {
}
}
func (c *Controller) processNextServiceWorkItem(logger klog.Logger) bool {
cKey, quit := c.serviceQueue.Get()
if quit {
return false
}
defer c.serviceQueue.Done(cKey)
err := c.syncService(logger, cKey)
c.handleErr(logger, err, cKey)
return true
}
#### syncService
* 获取 service 对象。
* 根据 service 的标签获取 pods (这里获取到的 pods 就是 slicesToCreate 凭据的点)。
* 根据 service 命名空间和标签获取 apiserver 已有的所有关联的 endpointSlices 。
* 过滤掉被标记为删除的 endpointSlice 。
* 实际最终调用 c.reconciler.reconcile() 。
func (c *Controller) syncService(logger klog.Logger, key string) error {
startTime := time.Now()
defer func() {
logger.V(4).Info("Finished syncing service endpoint slices", "key", key, "elapsedTime", time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
service, err := c.serviceLister.Services(namespace).Get(name)
if err != nil {
if !apierrors.IsNotFound(err) {
return err
}
c.triggerTimeTracker.DeleteService(namespace, name)
c.reconciler.DeleteService(namespace, name)
c.endpointSliceTracker.DeleteService(namespace, name)
// The service has been deleted, return nil so that it won't be retried.
return nil
}
if service.Spec.Type == v1.ServiceTypeExternalName {
// services with Type ExternalName receive no endpoints from this controller;
// Ref: https://issues.k8s.io/105986
return nil
}
if service.Spec.Selector == nil {
// services without a selector receive no endpoint slices from this controller;
// these services will receive endpoint slices that are created out-of-band via the REST API.
return nil
}
logger.V(5).Info("About to update endpoint slices for service", "key", key)
podLabelSelector := labels.Set(service.Spec.Selector).AsSelectorPreValidated()
pods, err := c.podLister.Pods(service.Namespace).List(podLabelSelector)
if err != nil {
// Since we're getting stuff from a local cache, it is basically
// impossible to get this error.
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToListPods",
"Error listing Pods for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
esLabelSelector := labels.Set(map[string]string{
discovery.LabelServiceName: service.Name,
discovery.LabelManagedBy: c.reconciler.GetControllerName(),
}).AsSelectorPreValidated()
endpointSlices, err := c.endpointSliceLister.EndpointSlices(service.Namespace).List(esLabelSelector)
if err != nil {
// Since we're getting stuff from a local cache, it is basically
// impossible to get this error.
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToListEndpointSlices",
"Error listing Endpoint Slices for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
// Drop EndpointSlices that have been marked for deletion to prevent the controller from getting stuck.
endpointSlices = dropEndpointSlicesPendingDeletion(endpointSlices)
if c.endpointSliceTracker.StaleSlices(service, endpointSlices) {
return endpointslicepkg.NewStaleInformerCache("EndpointSlice informer cache is out of date")
}
// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
// state of the trigger time tracker gets updated even if the sync turns out
// to be no-op and we don't update the EndpointSlice objects.
lastChangeTriggerTime := c.triggerTimeTracker.
ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
err = c.reconciler.Reconcile(logger, service, pods, endpointSlices, lastChangeTriggerTime)
if err != nil {
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToUpdateEndpointSlices",
"Error updating Endpoint Slices for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
return nil
}
### reconcile
**c.reconciler.reconcile()**
声明了两个切片 slicesToDelete , map slicesByAddressType
1. 检查 endpointSlice 的 AddressType ,不再支持的类型的加入到 slicesToDelete 等待删除,支持的加入 slicesByAddressType 。
2. 不同地址类型的 endpointSlice 都会调用 r.reconcileByAddressType() 函数去调谐,传的参数里面就包含了地址类型。
// Reconcile takes a set of pods currently matching a service selector and
// compares them with the endpoints already present in any existing endpoint
// slices for the given service. It creates, updates, or deletes endpoint slices
// to ensure the desired set of pods are represented by endpoint slices.
func (r Reconciler) Reconcile(logger klog.Logger, service corev1.Service, pods []corev1.Pod, existingSlices []discovery.EndpointSlice, triggerTime time.Time) error {
slicesToDelete := []*discovery.EndpointSlice{} // slices that are no longer matching any address the service has
errs := []error{} // all errors generated in the process of reconciling
slicesByAddressType := make(map[discovery.AddressType][]*discovery.EndpointSlice) // slices by address type
// addresses that this service supports [o(1) find]
serviceSupportedAddressesTypes := getAddressTypesForService(logger, service)
// loop through slices identifying their address type.
// slices that no longer match address type supported by services
// go to delete, other slices goes to the Reconciler machinery
// for further adjustment
for _, existingSlice := range existingSlices {
// service no longer supports that address type, add it to deleted slices
if !serviceSupportedAddressesTypes.Has(existingSlice.AddressType) {
if r.topologyCache != nil {
svcKey, err := ServiceControllerKey(existingSlice)
if err != nil {
logger.Info("Couldn't get key to remove EndpointSlice from topology cache", "existingSlice", existingSlice, "err", err)
} else {
r.topologyCache.RemoveHints(svcKey, existingSlice.AddressType)
}
}
slicesToDelete = append(slicesToDelete, existingSlice)
continue
}
// add list if it is not on our map
if _, ok := slicesByAddressType[existingSlice.AddressType]; !ok {
slicesByAddressType[existingSlice.AddressType] = make([]*discovery.EndpointSlice, 0, 1)
}
slicesByAddressType[existingSlice.AddressType] = append(slicesByAddressType[existingSlice.AddressType], existingSlice)
}
// reconcile for existing.
for addressType := range serviceSupportedAddressesTypes {
existingSlices := slicesByAddressType[addressType]
err := r.reconcileByAddressType(logger, service, pods, existingSlices, triggerTime, addressType)
if err != nil {
errs = append(errs, err)
}
}
// delete those which are of addressType that is no longer supported
// by the service
for _, sliceToDelete := range slicesToDelete {
err := r.client.DiscoveryV1().EndpointSlices(service.Namespace).Delete(context.TODO(), sliceToDelete.Name, metav1.DeleteOptions{})
if err != nil {
errs = append(errs, fmt.Errorf("error deleting %s EndpointSlice for Service %s/%s: %w", sliceToDelete.Name, service.Namespace, service.Name, err))
} else {
r.endpointSliceTracker.ExpectDeletion(sliceToDelete)
metrics.EndpointSliceChanges.WithLabelValues("delete").Inc()
}
}
return utilerrors.NewAggregate(errs)
}
**r.reconcileByAddressType()**
1. 数组 slicesToCreate 、 slicesToUpdate 、 slicesToDelete 。
2. 构建一个用于存放 endpointSlice 存在状态的结构体 existingSlicesByPortMap 。
3. 构建一个用于存放 endpointSlice 期望状态的结构体 desiredEndpointsByPortMap 。
4. 确定每组 endpointSlice 是否需要更新,调用 r.reconcileByPortMapping() 计算需要更新的 endpointSlice ,并返回 slicesToCreate, slicesToUpdate, slicesToDelete, numAdded, numRemoved 对象(计算过程遍历每个 slice 并填满至设定好的 endpoint 个数,默认 100 个,总长度不满 100 的单独一个 slice )给 r.finalize() 函数处理。
5. 调用 r.finalize() 创建、更新或删除指定的 endpointSlice 对象。
**r.finalize()**
1. 当同时有需要删除和新增的 slice 时,会优先把要删除的 slice 名替换到需要新增的 slice 上,再执行 slice 更新(意图是减少开销? 比如,要新增 A B C 三个,要删除 D E 两个,会遍历需要新增的 slice ,把 A 名替换成 D 的,B 替换成 E 的,再执行更新)
2. 之后依次执行新增,更新和删除 slices 。
### 总结
1. 总的来说,跟其他的控制器的逻辑是差不多的,都是先监听相关资源的事件,然后调谐。
2. 从上面的代码我们也不难看出,endpointslice 有个特点就是,默认情况下,每个 slice 都是满 100 个条目就 new 一个新的切片,把每个切片的容量都控制在 100 个条目以内。
3. 我们看完 endpointslice ,该控制器具有新增,更新和删除 slices 的功能,但是我们还发现源码里头还有 endpointslicemirroring 控制器。
4. endpointslicemirroring:在某些场合,应用会创建定制的 Endpoints 资源。为了保证这些应用不需要并发的更改 Endpoints 和 EndpointSlice 资源,集群的控制面将大多数 Endpoints 映射到对应的 EndpointSlice 之上。
控制面对 Endpoints 资源进行映射的例外情况有:
* Endpoints 资源上标签 endpointslice.kubernetes.io/skip-mirror 值为 true。
* Endpoints 资源包含标签 control-plane.alpha.kubernetes.io/leader。
* 对应的 Service 资源不存在。
* 对应的 Service 的选择算符不为空。
5. endpointslicemirroring 控制器我们等有时间再看看,我们先看看其他组件。