k8s在1.16 Alpha (v1.16) 版本提供了Topology Manager(https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0035-20190130-topology-manager.md),网上对这个模块的分析不是很多,基本上没有。之前对numa-aware resource allocation研究比较多,结合k8s官方文档,在这里也学习一下k8s是怎么处理numa问题的。分析错误或者不到位,还请各位大佬指教。所有的代码都在pkg/kubelet/cm/topologymanager目录下面。
Topology Manager模块诞生的背景:随着单个服务器节点内的处理器数量和硬件加速器越来越多,怎样优化延迟敏感型的高性能并行计算应用的性能成为了一个巨大的挑战。在K8S中,topologymanager就是这样一个组件用来优化容器中应用在高性能服务器上的性能。topology manager是kubelet的一个组件,为其他的组件充当着一个底层拓扑信息的来源,以便在资源分配的时候实现拓扑结构上的对齐。注意topology manager不是单独起作用的,它和CPU manager,device manager共同为pod分配资源,优化资源访问。
Topology Manager模块的工作原理:
kubernets官方文档里面对Topology Manager的解释是:topology manager从Hint Providers中以bitmask的形式接受NUMA拓扑信息(Topology Hint),包括当前可用的NUMA节点和倾向的资源分配指示。Topology manager策略对提供的hint进行一系列操作以后将所有的hint决定汇聚起来,得出一个最佳的hint(bestHint)。通过配置的策略(policy),主机节点可以选择接受或者拒绝pod。
要清晰的理解这段解释需要回答以下几个问题:(由于NUMA是常见的计算机系统架构,这里默认读者具有一定的体系架构知识,就不做展开了)
第一个问题其实是我们搞懂topology manager的基本。
这里面涉及到一系列的结构体和接口。首先HintProvider的定义是:HintProvider为一些遵循NUMA本地性同时期望得到全局最优资源对齐的组件提供接口。它提供了一个方法
type HintProvider interface {
// GetTopologyHints returns a map of resource names to a list of possible
// concrete resource allocations in terms of NUMA locality hints. Each hint
// is optionally marked "preferred" and indicates the set of NUMA nodes
// involved in the hypothetical allocation. The topology manager calls
// this function for each hint provider, and merges the hints to produce
// a consensus "best" hint. The hint providers may subsequently query the
// topology manager to influence actual resource assignment.
GetTopologyHints(pod v1.Pod, container v1.Container) map[string][]TopologyHint
}
这个方法的返回值是资源名到一列关于NUMA本地性的资源分配的映射。通过这个方法可以得到所有pod的container的TopologyHints,topology manager对每一个hintprovide都会调用这个方法,然后得到合并hint信息产生一个一致的最优的hint。
Topology Hint就是hintprovider(CPU manager,Devices manager等)提供的最基本的hint信息,我们先来看一下它的结构体定义:
type TopologyHint struct {
NUMANodeAffinity bitmask.BitMask
// Preferred is set to true when the NUMANodeAffinity encodes a preferred
// allocation for the Container. It is set to false otherwise.
Preferred bool
}
TopologyHint里面定义了两个对象,一个是NUMA节点的亲和度,它是一个bitmask类型;另一个是控制这个亲和度是否生效的布尔类型。
NUMA节点的亲和度(NUMANodeAffinity)是指容器线程更倾向运行对哪个NUMA节点上。bitmask的表示方法在很对应用里面都有实现过,举个简单的例子:主机系统里面有两个NUMA节点(节点0和节点1),如果容器线程对节点0亲和对节点1不亲和,那么用bitmask就可以表示为00000001;如果对节点1亲和对节点0不亲和那么可以表示为00000010,如果两个节点都亲和那么可以表示为00000011。Preferred是用来控制NUMANodeAffinity是否生效的一个布尔类型,如果Preferred等于true那么当前的亲和度就是有效的,如果等于false那么当前的亲和度无效。
第二个问题是topology manager的核心逻辑。
获取到topology hint以后,manager会对pod的所有hint进行累积,对累积后的hint计算亲和度,策略根据亲和度来决定资源的分配。
在创建manager的时候参数主要有两个
//NewManager creates a new TopologyManager based on provided policy
func NewManager(numaNodeInfo cputopology.NUMANodeInfo, topologyPolicyName string) (Manager, error) {
}
一个是CPU manager那边获取到的numaNodeInfo,另一个就是需要提供的拓扑策略topologyPolicyName。topology manager当前所支持的策略有四个:None,BestEffort,Restricted和SingleNumaNode。
None是默认的策略,这个策略不会做任何的NUMA感知的资源对齐。
BestEffort对于pod里面的每一个container都会调用它们的hint provider来发现它们的资源可用性。topology manager计算存储这些container的NUMA亲和度的信息,如果亲和度不能被满足的话,manager也会存储亲和度信息同时也会允许pod加入这个主机节点。
Restricted同样也会对pod里面的每一个container计算NUMA亲和度。不同的是,如果亲和度不能被满足的话,主机节点会拒绝这个pod的加入。这样会导致pod处于Terminated的状态。如果pod被接受了,那么亲和度信息将会被用于container的资源分配。
SingleNumaNode策略用来判断单个NUMA节点的亲和度是否满足,如果满足的话hint provider即可利用亲和度信息进行资源分配;如果不满足的话主机节点会拒绝pod的加入,这样也会导致pod处于Terminated的状态。
这四种策略中的三种都会用到NUMA节点亲和度来作为资源分配的依据,那么节点亲和度affinity又是怎样计算的呢?
1. 首先manager会遍历pod的所有container,通过GetTopologyHints收集topology hint,得到一个二维的映射map[string][]TopologyHint,返回所有的pod的container的hint信息
func (m *manager) accumulateProvidersHints(pod v1.Pod, container v1.Container) (providersHints []map[string][]TopologyHint) {
// Loop through all hint providers and save an accumulated list of the
// hints returned by each hint provider.
for _, provider := range m.hintProviders {
// Get the TopologyHints from a provider.
hints := provider.GetTopologyHints(pod, container)
providersHints = append(providersHints, hints)
klog.Infof("[topologymanager] TopologyHints for pod '%v', container '%v': %v", pod.Name, container.Name, hints)
}
return providersHints
}
2. 获得所有的hint信息以后,就要计算container对NUMA节点的亲和度affinity。calculateAffinity函数首先调用accumulateProvidersHints函数,得到所有pod的container的hint信息,再通过Merge函数计算最优的hint(bestHint)和admit。这里的关键就是Merge函数,它在每个策略(policy)中都被重载了,这里的重点就是通过mergeProvidersHints得到最优hint
func (m *manager) calculateAffinity(pod v1.Pod, container v1.Container) (TopologyHint, lifecycle.PodAdmitResult) {
providersHints := m.accumulateProvidersHints(pod, container)
bestHint, admit := m.policy.Merge(providersHints)
klog.Infof("[topologymanager] ContainerTopologyHint: %v", bestHint)
return bestHint, admit
}
Merge函数在每个策略里面都被重载
/topologymanager/policy_none.go
func (p *nonePolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, lifecycle.PodAdmitResult) {
return TopologyHint{}, p.canAdmitPodResult(nil)
}
/topologymanager/policy_best_effort.go
func (p *bestEffortPolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, lifecycle.PodAdmitResult) {
hint := mergeProvidersHints(p, p.numaNodes, providersHints)
admit := p.canAdmitPodResult(&hint)
return hint, admit
}
/topologymanager/policy_restricted.go
func (p *restrictedPolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, lifecycle.PodAdmitResult) {
hint := mergeProvidersHints(p, p.numaNodes, providersHints)
admit := p.canAdmitPodResult(&hint)
return hint, admit
}
/topologymanager/policy_single_numa_node.go
func (p *singleNumaNodePolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, lifecycle.PodAdmitResult) {
hint := mergeProvidersHints(p, p.numaNodes, providersHints)
admit := p.canAdmitPodResult(&hint)
return hint, admit
}
可以看出三个文件的merge函数都调用了mergeProvidersHints。我们继续追踪mergeProvidersHints,这个函数的目的就是把所有的hint合并,并找到最优的hint。
这个函数主要分为三个阶段
func mergeProvidersHints(policy Policy, numaNodes []int, providersHints []map[string][]TopologyHint) TopologyHint {
.....
}
2.1 第一个阶段设置默认的亲和度。通过系统中的NUMA节点数生成bitmask,供后面计算numaAffinity做初始化使用。
defaultAffinity, _ := bitmask.NewBitMask(numaNodes...)
2.2 第二个阶段遍历所有的hint provider,收集并存储hint。如果hint provider没有提供任何hint,那么就默认为该provider没有任何倾向性的资源分配意愿。最终返回allProviderHints
var allProviderHints [][]TopologyHint
for _, hints := range providersHints {
// If hints is nil, insert a single, preferred any-numa hint into allProviderHints.
if len(hints) == 0 {
allProviderHints = append(allProviderHints, []TopologyHint{{nil, true}})
continue
}
// Otherwise, accumulate the hints for each resource type into allProviderHints.
for resource := range hints {
if hints[resource] == nil {
allProviderHints = append(allProviderHints, []TopologyHint{{nil, true}})
continue
}
if len(hints[resource]) == 0 {
allProviderHints = append(allProviderHints, []TopologyHint{{nil, false}})
continue
}
allProviderHints = append(allProviderHints, hints[resource])
}
}
2.3 第三阶段遍历allProviderHints二维数组里面所有的topologyhints,通过按位与的方式操作topologyhint的亲和度bitmask,最后返回一个hint with the narrowest NUMANodeAffinity of all merged permutations that have at least one NUMA ID set(源代码注释)。官方对Narrowest的解释是:Narrowest in this case means the least number of NUMA nodes required to satisfy the resource request。下面我们通过代码来了解什么是hint with the narrowest NUMANodeAffinity
在每一个permutation的hint中获取到NUMANodeAffinity,如果NUMANodeAffinity不为空,就加到numaAffinities。同时对singlenumanode进行了特殊处理。最后得到numaAffinities。
preferred := true
var numaAffinities []bitmask.BitMask
for _, hint := range permutation {
if hint.NUMANodeAffinity == nil {
numaAffinities = append(numaAffinities, defaultAffinity)
} else {
numaAffinities = append(numaAffinities, hint.NUMANodeAffinity)
}
if !hint.Preferred {
preferred = false
}
// Special case PolicySingleNumaNode to only prefer hints where
// all providers have a single NUMA affinity set.
if policy != nil && policy.Name() == PolicySingleNumaNode && hint.NUMANodeAffinity != nil && hint.NUMANodeAffinity.Count() > 1 {
preferred = false
}
}
接着对得到的numaAffinities进行按位与(bitwise-and)操作,这里就解释了什么是narrowest NUMANodeAffinity of all merged permutations,因为按位与的话只有所有provider hint都preferred这个numa节点,这个节点对应的bitmask位才为1。操作之后得到mergeHint,再对mergeHint进行一系列的合法性检查,最终bestHint = mergeHint。bestHint就是当前那个节点是最优的
// Merge the affinities using a bitwise-and operation.
mergedAffinity, _ := bitmask.NewBitMask(numaNodes...)
mergedAffinity.And(numaAffinities...)
// Build a mergedHintfrom the merged affinity mask, indicating if an
// preferred allocation was used to generate the affinity mask or not.
mergedHint := TopologyHint{mergedAffinity, preferred}
// Only consider mergedHints that result in a NUMANodeAffinity > 0 to
// replace the current bestHint.
if mergedHint.NUMANodeAffinity.Count() == 0 {
return
}
// If the current bestHint is non-preferred and the new mergedHint is
// preferred, always choose the preferred hint over the non-preferred one.
if mergedHint.Preferred && !bestHint.Preferred {
bestHint = mergedHint
return
}
// If the current bestHint is preferred and the new mergedHint is
// non-preferred, never update bestHint, regardless of mergedHint's
// narowness.
if !mergedHint.Preferred && bestHint.Preferred {
return
}
// If mergedHint and bestHint has the same preference, only consider
// mergedHints that have a narrower NUMANodeAffinity than the
// NUMANodeAffinity in the current bestHint.
if !mergedHint.NUMANodeAffinity.IsNarrowerThan(bestHint.NUMANodeAffinity) {
return
}
// In all other cases, update best
第三个问题和第二问题是相辅相成的。
Topology Manager的调用关系(来自https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0035-20190130-topology-manager.md)
Topology Manager实现了pod admit处理接口给kubelet的pod admission,当 Admit函数被调用时,topology manager就开始收集从其他的组件中收集hints。
在/topologymanager/topology_manager.go文件里面定义了admit函数。
func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
klog.Infof("[topologymanager] Topology Admit Handler")
if m.policy.Name() == "none" {
klog.Infof("[topologymanager] Skipping calculate topology affinity as policy: none")
return lifecycle.PodAdmitResult{
Admit: true,
}
}
pod := attrs.Pod
c := make(map[string]TopologyHint)
for _, container := range append(pod.Spec.InitContainers, pod.Spec.Containers...) {
result, admitPod := m.calculateAffinity(*pod, container)
if !admitPod.Admit {
return admitPod
}
c[container.Name] = result
}
m.podTopologyHints[string(pod.UID)] = c
klog.Infof("[topologymanager] Topology Affinity for Pod: %v are %v", pod.UID, m.podTopologyHints[string(pod.UID)])
return lifecycle.PodAdmitResult{
Admit: true,
}
}
topology manager依靠Hintprovider提供的preferred的allocation信息来计算最佳分配的NUMA节点。这种方式有一个大前提,就是必须事先知道pod的container preferre那个NUMA节点。这个前提在现实应用中很难达到。一个是container里面运行的应用资源需求变化很大,很难事先就确定应用的资源分配方式从而给出hint。现在的NUMA-aware resource allocation基本上都是动态的,根据应用的实时资源需求来确定容器的资源分配。另外一个是NUMA问题不仅仅是NUMA locality的问题,对于大多数应用来说本地访问带来的性能可能是最优的,但是对于一些大规模内存密集型的应用来说本地访问会加大本地内存控制器的访问压力,反而会导致性能的下降。因此,从系统可用性的角度来说,topologymanager还需要提升很多方面。