亲和性约束调度
author:XiaoYang
一 简介
在未分析和深入理解scheduler源码逻辑之前,本人在操作配置亲和性上,由于官方和第三方文档者说明不清楚等原因,在亲和性理解上有遇到过一些困惑,如:
亲和性的operator的 “In”底层是什么匹配操作?正则匹配吗?“Gt/Lt”底层又是什么操作实现的?
所有能查到的文档描述pod亲和性的topoloykey有三个:
kubernetes.io/hostname
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
为什么?真的只支持这三个key?不能自定义?Pod与Node亲和性两种类型的差异是什么?而Pod亲和性正真要去匹配的是什么,其内在逻辑是?
不知道你们是否有同样类似的问题或困惑呢?当你清晰的理解了代码逻辑实现后,那么你会觉得一切是那么的
清楚明确了,不再有“隐性知识”问题存在。所以我希望本文所述内容能给大家在kubernetes亲和性的解惑上有所帮助。
1.1 约束调度
在展开源码分析之前为更好的理解亲和性代码逻辑,补充一些kubernetes调度相关的基础知识:
- 亲和性目的是为了实现用户可以按需将pod调度到
指定Node
上,我称之为“约束调度”
。 - 约束调度操作上常用以下三类:
- NodeSelector / NodeName node标签选择器 和 "nodeName"匹配
- Affinity (Node/Pod/Service) 亲和性
- Taint / Toleration 污点和容忍
- 本文所述主题是亲和性,亲和性分为三种类型Node、Pod、Service亲和,以下是亲和性预选和优选阶段代码实现的策略对应表(后面有详细分析):
预选阶段策略 | Pod.Spec配置 | 类别 | 次序 |
---|---|---|---|
MatchNodeSelecotorPred | NodeAffinity.RequiredDuringScheduling IgnoredDuringExecution |
Node | 6 |
MatchInterPodAffinityPred | PodAffinity.RequiredDuringScheduling IgnoredDuringExecution PodAntiAffinity.RequiredDuringScheduling IgnoredDuringExecution |
Pod | 22 |
CheckServiceAffinityPred | Service | 12 |
优选阶段策略 | Pod.Spec配置 | 默认权重 |
---|---|---|
InterPodAffinityPriority | PodAffinity.PreferredDuringScheduling IgnoredDuringExecution |
1 |
NodeAffinityPriority | NodeAffinity.PreferredDuringScheduling IgnoredDuringExecution |
1 |
1.2 Labels.selector标签选择器
labels selector是亲和性代码底层使用最基础的代码工具,不论是nodeAffinity还是podAffinity都是需要用到它。在使用yml类型deployment定义一个pod,配置其亲和性时须指定匹配表达式,其根本的匹配都是要对Node或pod的labels标签进行条件匹配。而这些labels标签匹配计算就必须要用到labels.selector工具(公共使用部分)。 所以在将此块最底层的匹配计算分析部分放在最前面,以便于后面源码分析部分更容易理解。
labels.selector接口定义,关键的方法是Matchs()
!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:36
type Selector interface {
Matches(Labels) bool
Empty() bool
String() string
Add(r ...Requirement) Selector
Requirements() (requirements Requirements, selectable bool)
DeepCopySelector() Selector
}
看一下调用端,如下面的几个实例的func,调用labels.NewSelector()实例化一个labels.selector对象返回.
func LabelSelectorAsSelector(ps *LabelSelector) (labels.Selector, error) {
...
selector := labels.NewSelector()
...
}
func NodeSelectorRequirementsAsSelector(nsm []v1.NodeSelectorRequirement) (labels.Selector, error) {
...
selector := labels.NewSelector()
...
}
func TopologySelectorRequirementsAsSelector(tsm []v1.TopologySelectorLabelRequirement) (labels.Selector, error) {
...
selector := labels.NewSelector()
...
}
NewSelector返回的是一个InternelSelector类型,而InternelSelector类型是一个Requirement(必要条件)
类型的列表。
!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:79
func NewSelector() Selector {
return internalSelector(nil)
}
type internalSelector []Requirement
InternelSelector类的Matches()底层实现是遍历调用requirement.Matches()
!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:340
func (lsel internalSelector) Matches(l Labels) bool {
for ix := range lsel {
// internalSelector[ix]为Requirement
if matches := lsel[ix].Matches(l); !matches {
return false
}
}
return true
}
再来看下requirment结构定义(key、操作符、值 ) "这就是配置的亲和匹配条件表达式"
!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:114
type Requirement struct {
key string
operator selection.Operator
// In huge majority of cases we have at most one value here.
// It is generally faster to operate on a single-element slice
// than on a single-element map, so we have a slice here.
strValues []string
}
requirment.matchs() 真正的条件表达式操作实现,基于表达式
operator
,计算key/value
,返回匹配与否
!FILENAME vendor/k8s.io/apimachinery/pkg/labels/selector.go:192
func (r *Requirement) Matches(ls Labels) bool {
switch r.operator {
case selection.In, selection.Equals, selection.DoubleEquals:
if !ls.Has(r.key) { //IN
return false
}
return r.hasValue(ls.Get(r.key))
case selection.NotIn, selection.NotEquals: //NotIn
if !ls.Has(r.key) {
return true
}
return !r.hasValue(ls.Get(r.key))
case selection.Exists: //Exists
return ls.Has(r.key)
case selection.DoesNotExist: //NotExists
return !ls.Has(r.key)
case selection.GreaterThan, selection.LessThan: // GT、LT
if !ls.Has(r.key) {
return false
}
lsValue, err := strconv.ParseInt(ls.Get(r.key), 10, 64) //能转化为数值的”字符数值“
if err != nil {
klog.V(10).Infof("ParseInt failed for value %+v in label %+v, %+v", ls.Get(r.key), ls, err)
return false
}
// There should be only one strValue in r.strValues, and can be converted to a integer.
if len(r.strValues) != 1 {
klog.V(10).Infof("Invalid values count %+v of requirement %#v, for 'Gt', 'Lt' operators, exactly one value is required", len(r.strValues), r)
return false
}
var rValue int64
for i := range r.strValues {
rValue, err = strconv.ParseInt(r.strValues[i], 10, 64)
if err != nil {
klog.V(10).Infof("ParseInt failed for value %+v in requirement %#v, for 'Gt', 'Lt' operators, the value must be an integer", r.strValues[i], r)
return false
}
}
return (r.operator == selection.GreaterThan && lsValue > rValue) || (r.operator == selection.LessThan && lsValue < rValue)
default:
return false
}
}
注:
除了LabelsSelector外还有NodeSelector 、FieldsSelector、PropertySelector等,但基本都是类似的Selector接口实现,逻辑上都基本一致,后在源码分析过程有相应的说明。
二 Node亲和性
Node亲和性基础描述:
yml配置实例sample:
---
apiVersion:v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity: #pod实例部署在prd-zone-A 或 prd-zone-B
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/prd-zone-name
operator: In
values:
- prd-zone-A
- prd-zone-B
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: securityZone
operator: In
values:
- BussinssZone
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
2.1 Node亲和性预选策略MatchNodeSelectorPred
策略说明:
基于NodeSelector和NodeAffinity定义为被调度的pod选择相匹配的Node(Nodes Labels)
适用NodeAffinity配置项:
NodeAffinity.Required
DuringSchedulingIgnoredDuringExecution
预选策略源码分析:
- 策略注册: defaults.init()注册了一条名为“MatchNodeSelectorPred”预选策略项,策略Func是PodMatchNodeSelector()
!FILENAME: pkg/scheduler/algorithmprovider/defaults/defaults.go:78
func init() {
...
factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)
...
}
- 策略Func: PodMatchNodeSelector()
获取目标Node信息,调用podMatchesNodeSelectorAndAffinityTerms()对被调度pod和目标node进行亲和性匹配。 如果符合则返回true,反之false并记录错误信息。
!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:853
func PodMatchNodeSelector(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
// 获取node信息
node := nodeInfo.Node()
if node == nil {
return false, nil, fmt.Errorf("node not found")
}
// 关键子逻辑func
// 输入参数:被调度的pod和前面获取的node(被检测的node)
if podMatchesNodeSelectorAndAffinityTerms(pod, node) {
return true, nil, nil
}
return false, []algorithm.PredicateFailureReason{ErrNodeSelectorNotMatch}, nil
}
podMatchesNodeSelectorAndAffinityTerms()
NodeSelector和NodeAffinity定义的"必要条件"配置匹配检测
!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:807
func podMatchesNodeSelectorAndAffinityTerms(pod *v1.Pod, node *v1.Node) bool {
// 如果设置了NodeSelector,则检测Node labels是否满足NodeSelector所定义的所有terms项.
if len(pod.Spec.NodeSelector) > 0 {
selector := labels.SelectorFromSet(pod.Spec.NodeSelector)
if !selector.Matches(labels.Set(node.Labels)) {
return false
}
}
//如果设置了NodeAffinity,则进行Node亲和性匹配 nodeMatchesNodeSelectorTerms() *[后面有详细分析]*
nodeAffinityMatches := true
affinity := pod.Spec.Affinity
if affinity != nil && affinity.NodeAffinity != nil {
nodeAffinity := affinity.NodeAffinity
if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution == nil {
return true
}
if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
nodeSelectorTerms := nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms
klog.V(10).Infof("Match for RequiredDuringSchedulingIgnoredDuringExecution node selector terms %+v", nodeSelectorTerms)
// 关键处理func: nodeMatchesNodeSelectorTerms()
nodeAffinityMatches = nodeAffinityMatches && nodeMatchesNodeSelectorTerms(node, nodeSelectorTerms)
}
}
return nodeAffinityMatches
}
注:
NodeSelector和NodeAffinity.Require... 都存在配置则
都
需True;如果NodeSelector失败则直接false,不处理NodeAffinity;
如果指定了多个 NodeSelectorTerms,那 node只要满足
其中一个
条件;如果指定了多个 MatchExpressions,那必须要满足
所有
条件.
nodeMatchesNodeSelectorTerms()
调用v1helper.MatchNodeSelectorTerms()进行NodeSelectorTerm定义的必要条件进行检测是否符合。
关键的配置定义分为两类(matchExpressions/matchFileds):
-“requiredDuringSchedulingIgnoredDuringExecution.matchExpressions”定义检测(匹配key与value)
-“requiredDuringSchedulingIgnoredDuringExecution.matchFileds”定义检测(不匹配key,只value)
!FILENAME pkg/scheduler/algorithm/predicates/predicates.go:797
func nodeMatchesNodeSelectorTerms(node *v1.Node, nodeSelectorTerms []v1.NodeSelectorTerm) bool {
nodeFields := map[string]string{}
// 获取检测目标node的Filelds
for k, f := range algorithm.NodeFieldSelectorKeys {
nodeFields[k] = f(node)
}
// 调用v1helper.MatchNodeSelectorTerms()
// 参数:nodeSelectorTerms 亲和性配置的必要条件Terms
// labels 被检测的目标node的label列表
// fields 被检测的目标node filed列表
return v1helper.MatchNodeSelectorTerms(nodeSelectorTerms, labels.Set(node.Labels), fields.Set(nodeFields))
}
// pkg/apis/core/v1/helper/helpers.go:302
func MatchNodeSelectorTerms( nodeSelectorTerms []v1.NodeSelectorTerm,
nodeLabels labels.Set, nodeFields fields.Set,) bool {
for _, req := range nodeSelectorTerms {
// nil or empty term selects no objects
if len(req.MatchExpressions) == 0 && len(req.MatchFields) == 0 {
continue
}
// MatchExpressions条件表达式匹配 ①
if len(req.MatchExpressions) != 0 {
labelSelector, err := NodeSelectorRequirementsAsSelector(req.MatchExpressions)
if err != nil || !labelSelector.Matches(nodeLabels) {
continue
}
}
// MatchFields条件表达式匹配 ②
if len(req.MatchFields) != 0 {
fieldSelector, err := NodeSelectorRequirementsAsFieldSelector(req.MatchFields)
if err != nil || !fieldSelector.Matches(nodeFields) {
continue
}
}
return true
}
return false
}
① NodeSelectorRequirementAsSelector()
是对“requiredDuringSchedulingIgnoredDuringExecution.matchExpressions"所配置的表达式进行Selector表达式进行格式化加工,返回一个labels.Selector实例化对象. [本文开头1.2章节有分析]
!FILENAME: pkg/apis/core/v1/helper/helpers.go:222
func NodeSelectorRequirementsAsSelector(nsm []v1.NodeSelectorRequirement) (labels.Selector, error) {
if len(nsm) == 0 {
return labels.Nothing(), nil
}
selector := labels.NewSelector()
for _, expr := range nsm {
var op selection.Operator
switch expr.Operator {
case v1.NodeSelectorOpIn:
op = selection.In
case v1.NodeSelectorOpNotIn:
op = selection.NotIn
case v1.NodeSelectorOpExists:
op = selection.Exists
case v1.NodeSelectorOpDoesNotExist:
op = selection.DoesNotExist
case v1.NodeSelectorOpGt:
op = selection.GreaterThan
case v1.NodeSelectorOpLt:
op = selection.LessThan
default:
return nil, fmt.Errorf("%q is not a valid node selector operator", expr.Operator)
}
// 表达式的三个关键要素: expr.Key, op, expr.Values
r, err := labels.NewRequirement(expr.Key, op, expr.Values)
if err != nil {
return nil, err
}
selector = selector.Add(*r)
}
return selector, nil
}
② NodeSelectorRequirementAsField
Selector()
是对“requiredDuringSchedulingIgnoredDuringExecution.matchFields"所配置的表达式进行Selector表达式进行格式化加工,返回一个Fields.Selector实例化对象.
!FILENAME pkg/apis/core/v1/helper/helpers.go:256
func NodeSelectorRequirementsAsFieldSelector(nsm []v1.NodeSelectorRequirement) (fields.Selector, error) {
if len(nsm) == 0 {
return fields.Nothing(), nil
}
selectors := []fields.Selector{}
for _, expr := range nsm {
switch expr.Operator {
case v1.NodeSelectorOpIn:
if len(expr.Values) != 1 {
return nil, fmt.Errorf("unexpected number of value (%d) for node field selector operator %q",
len(expr.Values), expr.Operator)
}
selectors = append(selectors, fields.OneTermEqualSelector(expr.Key, expr.Values[0]))
case v1.NodeSelectorOpNotIn:
if len(expr.Values) != 1 {
return nil, fmt.Errorf("unexpected number of value (%d) for node field selector operator %q",
len(expr.Values), expr.Operator)
}
selectors = append(selectors, fields.OneTermNotEqualSelector(expr.Key, expr.Values[0]))
default:
return nil, fmt.Errorf("%q is not a valid node field selector operator", expr.Operator)
}
}
return fields.AndSelectors(selectors...), nil
}
- 关键数据结构
NodeSelector相关结构的定义
!FILENAME vendor/k8s.io/api/core/v1/types.go:2436
type NodeSelector struct {
NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms" protobuf:"bytes,1,rep,name=nodeSelectorTerms"`
}
type NodeSelectorTerm struct {
MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty" protobuf:"bytes,1,rep,name=matchExpressions"`
MatchFields []NodeSelectorRequirement `json:"matchFields,omitempty" protobuf:"bytes,2,rep,name=matchFields"`
}
type NodeSelectorRequirement struct {
Key string `json:"key" protobuf:"bytes,1,opt,name=key"`
Operator NodeSelectorOperator `json:"operator" protobuf:"bytes,2,opt,name=operator,casttype=NodeSelectorOperator"`
Values []string `json:"values,omitempty" protobuf:"bytes,3,rep,name=values"`
}
type NodeSelectorOperator string
const (
NodeSelectorOpIn NodeSelectorOperator = "In"
NodeSelectorOpNotIn NodeSelectorOperator = "NotIn"
NodeSelectorOpExists NodeSelectorOperator = "Exists"
NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
NodeSelectorOpGt NodeSelectorOperator = "Gt"
NodeSelectorOpLt NodeSelectorOperator = "Lt"
)
FieldsSelector实现类的结构定义(Match value
)
!FILENAME vendor/k8s.io/apimachinery/pkg/fields/selector.go:78
type hasTerm struct {
field, value string
}
func (t *hasTerm) Matches(ls Fields) bool {
return ls.Get(t.field) == t.value
}
type notHasTerm struct {
field, value string
}
func (t *notHasTerm) Matches(ls Fields) bool {
return ls.Get(t.field) != t.value
}
2.2 Node亲和性优选策略NodeAffinityPriority
策略说明:
通过被调度的pod亲和性配置定义条件,对潜在可被调度运行的Nodes进行亲和性匹配并评分.
适用NodeAffinity配置项:
NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution
预选策略源码分析:
-
策略注册:defaultPriorities()注册了一条名为“NodeAffinityPriority”优选策略项.并注册了策略的两个方法Map/Reduce:
- CalculateNodeAffinityPriorityMap() map计算, 对潜在被调度Node进行亲和匹配,并为其计权重得分.
- CalculateNodeAffinityPriorityReduce() reduce计算,重新统计得分,取值区间0~10.
!FILENAME: pkg/scheduler/algorithmprovider/defaults/defaults.go:266
//k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/defaults.go/algorithmprovider/defaults.go
func defaultPriorities() sets.String {
...
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
...
}
-
策略Func:
map计算
CalculateNodeAffinityPriorityMap()
遍历affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution所 定义的Terms解NodeSelector对象(labels.selector)后,对潜在被调度Node的labels进行Match匹配检测,如果匹配则将条件所给定的Weight权重值累计。 最后将返回各潜在的被调度Node最后分值。
!FILENAME pkg/scheduler/algorithm/priorities/node_affinity.go:34
func CalculateNodeAffinityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
// 获取被检测的Node信息
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
// 默认为Spec配置的Affinity
affinity := pod.Spec.Affinity
if priorityMeta, ok := meta.(*priorityMetadata); ok {
// We were able to parse metadata, use affinity from there.
affinity = priorityMeta.affinity
}
var count int32
if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
// 遍历PreferredDuringSchedulingIgnoredDuringExecution定义的`必要条件项`(Terms)
for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]
if preferredSchedulingTerm.Weight == 0 { //注意前端的配置,如果weight为0则不做任何处理
continue
}
// TODO: Avoid computing it for all nodes if this becomes a performance problem.
// 获取node亲和MatchExpression表达式条件,实例化label.Selector对象.
nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)
if err != nil {
return schedulerapi.HostPriority{}, err
}
if nodeSelector.Matches(labels.Set(node.Labels)) {
count += preferredSchedulingTerm.Weight
}
}
}
// 返回Node得分
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(count),
}, nil
}
再次看到前面(预选策略分析时)分析过的NodeSelectorRequirementAsSelector()
返回一个labels.Selector实例对象
使用selector.Matches对node.Labels进行匹配是否符合条件.
reduce计算
CalculateNodeAffinityPriorityReduce()将各个node的最后得分重新计算分布区间在0〜10.
代码内给定一个NormalizeReduce()方法,MaxPriority值为10,reverse取反false关闭
!FILENAME pkg/scheduler/algorithm/priorities/node_affinity.go:77
const MaxPriority = 10
var CalculateNodeAffinityPriorityReduce = NormalizeReduce(schedulerapi.MaxPriority, false)
NormalizeReduce()
- 结果评分取值0〜MaxPriority
- reverse取反为true时,最终评分=(MaxPriority-其原评分值)
!FILENAME pkg/scheduler/algorithm/priorities/reduce.go:29
func NormalizeReduce(maxPriority int, reverse bool) algorithm.PriorityReduceFunction {
return func(
_ *v1.Pod,
_ interface{},
_ map[string]*schedulercache.NodeInfo,
result schedulerapi.HostPriorityList) error {
var maxCount int
// 取出最大的值
for i := range result {
if result[i].Score > maxCount {
maxCount = result[i].Score
}
}
// 如果最大的值为0,且取反设为真,则将所有的评分设置为MaxPriority
if maxCount == 0 {
if reverse {
for i := range result {
result[i].Score = maxPriority
}
}
return nil
}
// 计算后得分 = maxPrority * 原分值 / 最大值
// 如果取反为真则 maxPrority - 计算后得分
for i := range result {
score := result[i].Score
score = maxPriority * score / maxCount
if reverse {
score = maxPriority - score
}
result[i].Score = score
}
return nil
}
}
...未完,请参看后续二pod亲和性.
文章及内容转发请署名XiaoYang