此文为学习《Kubernetes权威指南》的相关笔记
学习笔记:
不论是nodeSelector调度方案还是亲和性调度方案,都是为Pod在创建时提供了选择Node的主动性,但是在很多场景中,Node同样需要具备选择自身可部署Pod属性的主动性,Taints和Tolerations(污点和容忍)机制使Node能够主动选择(设置Taints)具备某种属性(带有Tolerations)的Pod,也具备拒绝和驱逐Pod的主动性。
Node在设置Taints后,只有明确声明了Tolerations去容忍这些污点的Pod才有资格被调度在该Node上运行(NoSchedule效果),于此同时,Node可以设置效果是NoExecue的Taints污点,直接驱逐该Node上不具备对应Tolerations的Pod。
K8s调度器处理多个Taint和Toleration的逻辑顺序为,列出节点所有Taint,忽略Pod的Toleration能够匹配的部分,剩下的没有忽略的Taint就是对Pod的效果,几种特殊情况如下:
这种机制的引入使K8s进一步提升了调度上的灵活性
在实例之前,尝试查看当前节点是否包含Taints:
使用kebectl describe node
# kubectl describe node miwifi-r4cm-srv
......
Taints: node-role.kubernetes.io/master:NoSchedule......
可以看到,K8s集群的Master节点上原生地带有一个Taint,这个污点标志了这个节点是Master节点,污点效果是NoSchedule:不调度,所有不包含对应容忍项的Pod在创建后都不会被调度到这个节点上,K8s正是通过这种方式让Master节点默认不作为工作节点承担工作任务。
删除节点Taints的方式与删除Labels的方式相似:
# kubectl taint nodes
删除Master节点的这个污点之后,Master节点同样可以作为工作节点加入调度器的考虑范围
命令格式: # kubectl taint nodes
其中effect为Pod中不包含该污点对应容忍时产生的效果
为xu.node1添加3个Taint
可以看到,即使key和value值都相同,若想产生多个effect,就必须定义多个Taint
# kubectl taint node xu.node1 key1=value1:NoSchedule
node/xu.node1 tainted
# kubectl taint node xu.node1 key1=value1:NoExecute
node/xu.node1 tainted
# kubectl taint node xu.node1 key2=value2:NoSchedule
node/xu.node1 tainted
可以看到Node的属性定义中已经包含了刚才添加的三个脏点
# kubectl describe node xu.node1
......
Taints: key1=value1:NoExecute
key1=value1:NoSchedule
key2=value2:NoSchedule......
新建Pod:taint-test1
在该Pod中定义两个容忍项,分别与xu.Node1的前两个污点相对应
可以看到,容忍若想和污点匹配,必须做到key\value\effect都是一一对应的
另外,有如下两个特例:
apiVersion: v1
kind: Pod
metadata:
name: taint-test1
spec:
containers:
- name: taint-test1
image: busybox
tolerations:
- key: "key1" # key1=value1:NoSchedule
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1" # key1=value1:NoExecute
operator: "Equal"
value: "value1"
effect: "NoExecute"
创建完成后,使用kubectl describe命令查看Pod详情
可以看到该Pod的Tolerations包含了创建时定义的两个容忍项
.......
Tolerations: key1=value1:NoSchedule
key1=value1:NoExecute
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s......
因为没有满足唯一一个工作节点xu.node1的第三个Taint,这个Pod被调度失败
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedSchedulingdefault-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
新节点taint-test2定义如下
.....
tolerations:
- key: "key1" # key1=value1:NoSchedule
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1" # key1=value1:NoExecute
operator: "Equal"
value: "value1"
effect: "NoExecute"
- key: "key2" # key2=value2:NoSchedule
operator: "Equal"
value: "value2"
effect: "NoSchedule"
创建Pod,可以看到被成功调度到xu.node1上
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduleddefault-scheduler Successfully assigned default/taint-test2 to xu.node1
这里出现了问题:在Node上定义了effect为NoExecution的Taint后,在满足污点要求的节点被调度到Node上时,会出现如下错误,目前猜测是这个污点的定义让工作节点上的K8s系统容器停止运作,有待深究后更新。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduleddefault-scheduler Successfully assigned default/taint-test2 to xu.node1
Warning FailedCreatePodSandBox 15m kubelet, xu.node1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6" network for pod "taint-test2": networkPlugin cni failed to set up pod "taint-test2_default" network: unable to allocate IP address: Post http://127.0.0.1:6784/ip/2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6: dial tcp 127.0.0.1:6784: connect: connection refused, failed to clean up sandbox container "2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6" network for pod "taint-test2": networkPlugin cni failed to teardown pod "taint-test2_default" network: Delete http://127.0.0.1:6784/ip/2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6: dial tcp 127.0.0.1:6784: connect: connection refused]
Normal SandboxChanged 40s (x70 over 15m) kubelet, xu.node1 Pod sandbox changed, it will be killed and re-created.
①将部分节点留给一些特定的应用使用(某些关键Pod需要单独开辟一些专用节点)
②具有特殊硬件设备的节点,优先给真正需要这些硬件的Pod使用
③K8s在节点故障的情况下,自动以限速的模式逐步为Node设置Taints
对于任何一个Pod都可以看到存在原生的容忍项:
......
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s......
当系统自动为出现故障的节点添加Taints时,上述自动设置的容忍项将保证Pod在被驱逐前再运行300s
如果想自定义上述容忍项,格式如下:
tolerations:
- key: "node.alpha.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 6000
作为学习过程的总结和回顾,尝试部署一个Deployment
定义的Pod中包含对于Master节点的原生污点的容忍项
配置文件如下
apiVersion: apps/v1
kind: Deployment
metadata:
name: taint-test
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
resources:
requests:
cpu: "300m"
memory: "64Mi"
limits:
cpu: "1000m"
memory: "128Mi"tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: ""
部署该Deployment后,可以看到两个Pod副本被分别调度到主结点和唯一的工作节点上
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint-test-869c6dc8d5-qz5g9 1/1 Running 0 86s 10.32.0.4 miwifi-r4cm-srv
taint-test-869c6dc8d5-xbkcg 1/1 Running 0 86s 10.44.0.1 xu.node1