上篇文章《CNI插件之CNI插件最简实现之macvlan plugin》我们介绍了macvlan插件,通过使用与分析,我们知道:
flannel插件的实现上解决了上面列出的5个问题:
说了这么多优点,那flannel如何部署使用,具体怎样实现的呢?
这也是本篇文章要介绍的,这里罗列下面会介绍的内容:
flannel网络插件实现依赖的技术包括:
上面DaemonSet,ConfigMap,RBAC相关的内容后续会出相应的章节介绍,感兴趣的跳转链接阅读(链接待添加)
这些组成大部分可以从yaml配置文件获取到,我们给出上面的链接对应的配置文件,以及简单的注释:
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy #POD节点安全策略相关
metadata:
name: psp.flannel.unprivileged
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
spec:
privileged: true
volumes:
- configMap
- secret
- emptyDir
- hostPath
allowedHostPaths: #宿主机目录权限设置
- pathPrefix: "/etc/cni/net.d"
- pathPrefix: "/etc/kube-flannel"
- pathPrefix: "/run/flannel"
readOnlyRootFilesystem: false
# Users and groups
runAsUser:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
fsGroup:
rule: RunAsAny
# Privilege Escalation
allowPrivilegeEscalation: false
defaultAllowPrivilegeEscalation: false
# Capabilities
allowedCapabilities: ['NET_ADMIN']
defaultAddCapabilities: []
requiredDropCapabilities: []
# Host namespaces
hostPID: false
hostIPC: false
hostNetwork: true
hostPorts:
- min: 0
max: 65535
# SELinux
seLinux:
# SELinux is unused in CaaSP
rule: 'RunAsAny'
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: flannel #ClusterRole角色
rules:
- apiGroups: ['extensions']
resources: ['podsecuritypolicies'] #权限资源类型
verbs: ['use']
resourceNames: ['psp.flannel.unprivileged'] #权限资源名称
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
---
kind: ClusterRoleBinding #权限绑定,给flannel(ServiceAccount)绑定flannel(ClusterRole)角色的权限
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount #创建ServiceAccount 账号
metadata:
name: flannel
namespace: kube-system
---
kind: ConfigMap #用于保存配置信息的键值对,主要用于给容器内应用程序提供配置
apiVersion: v1
metadata:
name: kube-flannel-cfg #这里定义了kube-flannel-cfg这个configmap 后面以存储卷的形式提供给后面的DaemonSet
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel", #cni插件类型
"delegate": { #委托,这里实际调用的是bridge插件
"hairpinMode": true, #支持hairpinMode 用于实现pod访问集群服务后,重新负载均衡到本pod。
"isDefaultGateway": true #设置cni0网关ip,同时设置pod节点默认网关为cni0的ip,同bridge插件说明。
}
},
{
"type": "portmap", #级联插件用于实现类似端口映射,nat的功能。
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "192.16.0.0/16", #集群pod节点使用的网络网段
"Backend": {
"Type": "vxlan" #backend的类型,这里使用vxlan,还可以udp/host-gw等
}
}
---
apiVersion: apps/v1
kind: DaemonSet #DaemonSet保障集群各个节点有一个副本
metadata:
name: kube-flannel-ds-amd64
namespace: kube-system
labels:
tier: node
app: flannel
spec:
selector:
matchLabels:
app: flannel
template:
metadata:
labels:
tier: node
app: flannel
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
hostNetwork: true
tolerations:
- operator: Exists
effect: NoSchedule
serviceAccountName: flannel
initContainers:
- name: install-cni
image: quay.io/coreos/flannel:v0.12.0-amd64 #使用的flannel镜像版本
command:
- cp
args:
- -f
- /etc/kube-flannel/cni-conf.json
- /etc/cni/net.d/10-flannel.conflist #容器应用输入的cni配置文件
volumeMounts:
- name: cni
mountPath: /etc/cni/net.d
- name: flannel-cfg
mountPath: /etc/kube-flannel/
containers:
- name: kube-flannel
image: quay.io/coreos/flannel:v0.12.0-amd64
command:
- /opt/bin/flanneld #容器应用二进制 flanneld
args:
- --ip-masq #代表处公网要走snat
- --kube-subnet-mgr #代表使用kube的subnet-manager,有别于etcd的subnet-manager,该类型基于k8s的节点CIDR
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "100m"
memory: "50Mi"
securityContext:
privileged: true
capabilities:
add: ["NET_ADMIN"]
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: run
mountPath: /run/flannel
- name: flannel-cfg
mountPath: /etc/kube-flannel/
volumes:
- name: run
hostPath:
path: /run/flannel #运行相关目录
- name: cni
hostPath:
path: /etc/cni/net.d #cni插件配置目录
- name: flannel-cfg
configMap:
name: kube-flannel-cfg #使用的configmap配置
由于我们之前已经安装了macvlan了,并且部分容器已经添加进了macvlan创建的网络。所以这里我们在使用flannel插件前,先要重置网络:
重置详细命令较多,后面会有一个安装/重置的章节专门说明,如果使用前一章介绍的macvlan的方式安装,可以通过这个方式重置:链接
安装flannel插件,相对maxvlan插件,是将配置文件直接写在yaml里面,我们这里提供了一个典型的flannel yaml配置,这个配置和上一章介绍的yaml文件是一致的。
下载下来后,只要执行:
kubectl apply -f kube-falannel.yml
配置文件里面有一个比较关键的配置:
net-conf.json: |
{
"Network": "192.16.0.0/16", //集群pod节点网络
"Backend": {
"Type": "vxlan" //flannel网络类型,可以vxlan/udp/host-gw等
}
}
配置文件里面的其它部分我们后面再做一个整体的介绍。
运行过后,集群各节点就会从NotReady变成Ready节点状态。
集群各节点会看到cni0网桥,连接到cni0网桥的veth设备
[root@k8s-new-master flannel]# ifconfig cni
cni0: flags=4163 mtu 1450
inet 192.16.0.1 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::8c45:9bff:feb9:8700 prefixlen 64 scopeid 0x20
ether 8e:45:9b:b9:87:00 txqueuelen 1000 (Ethernet)
RX packets 2699334 bytes 233169100 (222.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2753084 bytes 650775039 (620.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@k8s-new-master flannel]# brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.1a64c8fcc7c5 no veth501950ba
veth9abcf99e
veth设备对,其中sh-4.2# 表示在容器里面,容器里面eth0后面的@6与宿主机的编号6接口是一对直连,同理另外一个容器里面eth0后面的@7与宿主机的编号7接口也是直连的。
[root@k8s-new-master flannel]# ip link
6: veth501950ba@if3: mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
link/ether ba:c0:8d:41:3f:30 brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: veth9abcf99e@if3: mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
link/ether 62:c1:58:cb:5e:14 brd ff:ff:ff:ff:ff:ff link-netnsid 1
sh-4.2# ip addr
3: eth0@if6: mtu 1450 qdisc noqueue state UP group default
link/ether 76:41:a1:96:53:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.16.0.72/24 scope global eth0
valid_lft forever preferred_lft forever
sh-4.2# ip addr
3: eth0@if7: mtu 1450 qdisc noqueue state UP group default
link/ether 4a:aa:c6:b8:5a:12 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.16.0.73/24 scope global eth0
valid_lft forever preferred_lft forever
flanneld进程
这里指定ip-masq表示,访问外部网络所需的nat规则由flanneld进程创建,bridge插件那边要关闭nat规则的创建。
[root@k8s-new-master flannel]# ps -aux |grep flanneld
root 19979 0.1 0.2 621916 19036 ? Ssl Jul04 2:41 /opt/bin/flanneld --ip-masq --kube-subnet-mgr
root 31718 0.0 0.0 112712 940 pts/1 S+ 22:58 0:00 grep --color=auto flanneld
看下flannel自动生成的完整插件配置:
cat /var/lib/cni/flannel/3153d1047e5ac34b276123db3b80eeed35320933778dfe5308ddfeaa84299c72
{
"cniVersion":"0.3.1",
"hairpinMode":true, #发夹模式,支持单个pod节点请求,最后负载均衡到本pod
"ipMasq":false, #关闭bridge生成访问外网的nat规则
"ipam":
{
"routes":[{"dst":"192.16.0.0/16"}],
"subnet":"192.16.0.0/24",
"type":"host-local" #ip分配管理插件类型:host-local
},
"isDefaultGateway":true,
"isGateway":true, #自动设置网关ip到网桥cni0上,自动在容器内部添加默认网关路由
"mtu":1450,
"name":"cbr0",
"type":"bridge" #cni插件类型bridge
}
接下来是使用vxlan作为backend(后端)所创建的信息。大概包括4个核心信息:
首先是flannel.1 (VTEP)虚拟设备,VTEP设备参数如下:
[root@k8s-new-master flannel]# ifconfig flannel.1
flannel.1: flags=4163 mtu 1450
inet 192.16.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::2077:b4ff:fee8:3e6f prefixlen 64 scopeid 0x20
ether 22:77:b4:e8:3e:6f txqueuelen 0 (Ethernet)
RX packets 467309 bytes 34387845 (32.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 436038 bytes 64505265 (61.5 MiB)
TX errors 0 dropped 24 overruns 0 carrier 0 collisions 0
[root@k8s-new-master flannel]# ip -d link show
4: flannel.1: mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether fa:8f:c5:04:ab:97 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 192.168.122.14 dev ens3 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
接下来看下跨节点访问所需的路由表
这里先说下我们集群pod节点分配的ip网段:
master:192.16.0.0/24
node1:192.16.1.0/24
node2:12.16.2.0/24
所以在master查看到,到192.16.1.0/24 及192.16.2.0/24两个网段需要走vxlan设备flannel.1。
这里查询结果里面的onlink标志,onlink 参数表明强制此网关是“在链路上”的 (虽然并没有链路层路由),否则 linux 上面是没法添加不同网段的路由。这样数据包就能知道,如果是容器直接的访问则交给 flannel.1 设备处理。
这样跨节点容器间访问时(192.16.0.1->192.16.1.1),数据首先在容器内走默认网关到cni0网桥,然后走路由到flannel.1设备,接着封装目的二层信息,这里目的mac应该选谁?
[root@k8s-new-master flannel]# route -n |grep flannel.1
192.16.1.0 192.16.1.0 255.255.255.0 UG 0 0 0 flannel.1
192.16.2.0 192.16.2.0 255.255.255.0 UG 0 0 0 flannel.1
[root@k8s-new-master flannel]# ip route show dev flannel.1
192.16.1.0/24 via 192.16.1.0 onlink
192.16.2.0/24 via 192.16.2.0 onlink
答案是填对端的VTEP设备的mac地址,又IP查询mac地址依赖的是arp表。所以flanneld进程会为每个加入集群的VTEP设备添加一个arp表项,permannent永久的。
[root@k8s-new-master flannel]# arp -an |grep flannel.1
? (192.16.1.0) at 2e:2a:a5:7c:e8:f2 [ether] PERM on flannel.1
? (192.16.2.0) at 7a:50:9c:c8:99:d7 [ether] PERM on flannel.1
[root@k8s-new-master flannel]# ip neig show dev flannel.1
192.16.1.0 lladdr 2e:2a:a5:7c:e8:f2 PERMANENT
192.16.2.0 lladdr 7a:50:9c:c8:99:d7 PERMANENT
我们知道,vxlan是一个将二层帧封装在udp里面的数据包,填充完了二层,我们如何知道这个包要发送给谁? 创建VTEP的时候,我们指定了发送的源IP(宿主机ip),端口信息,那么目的IP端口信息显然就是对端的宿主机ip,这个信息其实是被flanneld进程静态的添加进转发表里面,bridge fdb里面存着一个到目的mac地址(这里是目的VTEP mac)所需的目的IP(目的宿主IP),注意这里也是permanent永久性的。
[root@k8s-new-master flannel]# bridge fdb |grep flan
2e:2a:a5:7c:e8:f2 dev flannel.1 dst 192.168.122.15 self permanent
7a:50:9c:c8:99:d7 dev flannel.1 dst 192.168.122.16 self permanent
接下来我们看下flannel生成的各节点网段等信息,以master为例,与节点的CIDR一致
[root@k8s-new-master ns_tools]# cat /var/run/flannel/subnet.env
FLANNEL_NETWORK=192.16.0.0/16
FLANNEL_SUBNET=192.16.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
[root@k8s-new-master ns_tools]# kubectl describe node k8s-new-master |grep CIDR
PodCIDR: 192.16.0.0/24
自动生成pod内容器访问外网所需的iptables规则:
[root@k8s-new-node2 ~]# iptables -S -t nat
-A POSTROUTING -s 192.16.0.0/16 -d 192.16.0.0/16 -j RETURN
-A POSTROUTING -s 192.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 192.16.0.0/16 -d 192.16.2.0/24 -j RETURN
-A POSTROUTING ! -s 192.16.0.0/16 -d 192.16.0.0/16 -j MASQUERADE
上述四条规则作用分别是:
通过这么一个yaml文件,我们已经安装完了flannel,也熟悉安装完后,会生成的特定规则,接下来我们通过三个实验,详细介绍下flannel实现单节点容器间通信,容器访问外部网络通信,跨节点容器间通信具体实现机制。
[root@k8s-new-master ~]# ip netns add net1
[root@k8s-new-master ~]# ip netns add net2
[root@k8s-new-master cni]# ip link add veth_test_1 type veth peer name veth_test_2
[root@k8s-new-master cni]# ifconfig veth_test_1
veth_test_1: flags=4098 mtu 1500
ether 92:0b:3c:57:44:91 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@k8s-new-master cni]# ifconfig veth_test_2
veth_test_2: flags=4098 mtu 1500
ether b2:9e:68:8a:5a:7a txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@k8s-new-master cni]# ip link set veth_test_1 netns net1
[root@k8s-new-master cni]# ip netns exec net1 ip link
1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: veth_test_1@if10: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 92:0b:3c:57:44:91 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[root@k8s-new-master cni]# ip netns exec net1 ip link set veth_test_1 name eth0
[root@k8s-new-master cni]# ip netns exec net1 ip link
1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: eth0@if10: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 92:0b:3c:57:44:91 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[root@k8s-new-master cni]# ip netns exec net1 ip addr add 192.168.88.1/24 dev eth0
[root@k8s-new-master cni]# ip netns exec net1 ip link set eth0 up
[root@k8s-new-master cni]# ip netns exec net1 ifconfig
eth0: flags=4099 mtu 1500
inet 192.168.88.1 netmask 255.255.255.0 broadcast 0.0.0.0
ether 92:0b:3c:57:44:91 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@k8s-new-master cni]# brctl addbr my_cni
[root@k8s-new-master cni]# brctl addif my_cni veth_test_2
[root@k8s-new-master cni]# brctl show my_cni
bridge name bridge id STP enabled interfaces
my_cni 8000.b29e688a5a7a no veth_test_2
[root@k8s-new-master cni]# ip link add veth_test_3 type veth peer name veth_test_4
[root@k8s-new-master cni]# ip link set veth_test_3 netns net2
[root@k8s-new-master cni]# ip netns exec net2 ip link set veth_test_3 name eth0
[root@k8s-new-master cni]# ip netns exec net2 ip addr add 192.168.88.2/24 dev eth0
[root@k8s-new-master cni]# ip netns exec net2 ip link set eth0 up
[root@k8s-new-master cni]# brctl show my_cni
bridge name bridge id STP enabled interfaces
my_cni 8000.563c09befbc8 no veth_test_2
veth_test_4
[root@k8s-new-master cni]# iptables -P FORWARD ACCEPT
[root@k8s-new-master cni]# ifconfig veth_test_2 up
[root@k8s-new-master cni]# ifconfig veth_test_4 up
[root@k8s-new-master cni]# ip netns exec net2 ping 192.168.88.1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=64 time=0.282 ms
64 bytes from 192.168.88.1: icmp_seq=2 ttl=64 time=0.095 ms
[root@k8s-new-master cni]# ifconfig my_cni 192.168.88.10/24 up
[root@k8s-new-master cni]# ip netns exec net2 ping 192.168.88.10
PING 192.168.88.10 (192.168.88.10) 56(84) bytes of data.
64 bytes from 192.168.88.10: icmp_seq=1 ttl=64 time=0.161 ms
[root@k8s-new-master cni]# ip netns exec net1 ping 192.168.88.10
PING 192.168.88.10 (192.168.88.10) 56(84) bytes of data.
64 bytes from 192.168.88.10: icmp_seq=1 ttl=64 time=0.190 ms
[root@k8s-new-master cni]# bridge link
10: veth_test_2 state UP @(null): mtu 1500 master my_cni state forwarding priority 32 cost 2
13: veth_test_4 state UP @(null): mtu 1500 master my_cni state forwarding priority 32 cost 2
[root@k8s-new-master cni]# bridge fdb |grep my_cni
b2:9e:68:8a:5a:7a dev veth_test_2 vlan 1 master my_cni permanent # veth_test_2 mac地址
b2:9e:68:8a:5a:7a dev veth_test_2 master my_cni permanent
92:0b:3c:57:44:91 dev veth_test_2 master my_cni # veth_test_1 现在在net1隔离空间里面的eth0的mac地址, 非permanent,无数据时,会老化
56:3c:09:be:fb:c8 dev veth_test_4 vlan 1 master my_cni permanent # veth_test_4 mac地址
56:3c:09:be:fb:c8 dev veth_test_4 master my_cni permanent
e6:51:5f:c2:ad:4a dev veth_test_4 master my_cni # veth_test_3 现在在net2隔离空间里面的eth0的mac地址, 非permanent,无数据时,会老化
33:33:00:00:00:01 dev my_cni self permanent
01:00:5e:00:00:01 dev my_cni self permanent
33:33:ff:be:fb:c8 dev my_cni self permanent
[root@k8s-new-master cni]# arp -i my_cni
Address HWtype HWaddress Flags Mask Iface
192.168.88.2 ether e6:51:5f:c2:ad:4a C my_cni
192.168.88.1 ether 92:0b:3c:57:44:91 C my_cni
[root@k8s-new-master ~]# bridge monitor # 在老化之后,如果我们执行宿主节点访问隔离空间ip(192.168.88.2/192.168.88.1)就会触发网桥学习mac地址
e6:51:5f:c2:ad:4a dev veth_test_4 master my_cni # 学习到对应的mac地址与dev:veth_test_4(类似交换机的port)的关系
92:0b:3c:57:44:91 dev veth_test_2 master my_cni
[root@k8s-new-master ~]# bridge -s fdb |grep my_cni #查看收发包情况
92:0b:3c:57:44:91 dev veth_test_2 used 19/14 master my_cni
b2:9e:68:8a:5a:7a dev veth_test_2 vlan 1 used 5838/5838 master my_cni permanent
b2:9e:68:8a:5a:7a dev veth_test_2 used 5838/5838 master my_cni permanent
56:3c:09:be:fb:c8 dev veth_test_4 vlan 1 used 5484/5484 master my_cni permanent
56:3c:09:be:fb:c8 dev veth_test_4 used 5484/5484 master my_cni permanent
e6:51:5f:c2:ad:4a dev veth_test_4 used 22/17 master my_cni
[root@k8s-new-master cni]# ip netns exec net1 ping baidu.com
ping: baidu.com: Name or service not known
iptables -t nat -A POSTROUTING -s 192.168.8.0/24 ! -o cni0 -j MASQUERADE
[root@k8s-new-master cni]# ip netns exec net1 route add default gw 192.168.88.10
[root@k8s-new-master cni]# ip netns exec net2 route add default gw 192.168.88.10
[root@k8s-new-master cni]# iptables -t nat -A POSTROUTING -s 192.168.88.0/24 -d 192.168.88.0/24 -j RETURN
[root@k8s-new-master cni]# iptables -t nat -A POSTROUTING -s 192.168.88.0/24 ! -d 224.0.0.0/4 -j MASQUERADE
[root@k8s-new-master cni]# ip netns exec net1 ping baidu.com
PING baidu.com (220.181.38.148) 56(84) bytes of data.
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=1 ttl=45 time=39.0 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=2 ttl=45 time=38.5 ms
[root@k8s-new-master cni]# ip netns exec net2 ping baidu.com
PING baidu.com (39.156.69.79) 56(84) bytes of data.
64 bytes from 39.156.69.79 (39.156.69.79): icmp_seq=1 ttl=44 time=41.7 ms
64 bytes from 39.156.69.79 (39.156.69.79): icmp_seq=2 ttl=44 time=41.7 ms
上面的实验,我们已经在k8s-new-master(192.168.122.14)上创建一个网桥my_cni(192.168.88.10),创建两个隔离空间net1(内部eth0:192.168.88.1),net2(内部eth0:192.168.88.2)
为了测试跨节点通信,我们这里先在k8s-new-node1(192.168.122.15)上创建隔离空间net3(内部eth0:192.168.89.1),并创建相应的网关my_cni1(192.168.89.10)。步骤同上面,这里不注释
[root@k8s-new-node1 ~]# ip netns add net3
[root@k8s-new-node1 ~]# ip link add veth_test_5 type veth peer name veth_test_6
[root@k8s-new-node1 ~]# ip link set veth_test_5 netns net3
[root@k8s-new-node1 ~]# ip netns exec net3 ip link set veth_test_5 name eth0
[root@k8s-new-node1 ~]# ip netns exec net3 ip addr add 192.168.89.1/24 dev eth0
[root@k8s-new-node1 ~]# ip netns exec net3 ip link set eth0 up
[root@k8s-new-node1 ~]# brctl addbr my_cni1
[root@k8s-new-node1 ~]# brctl addif my_cni1 veth_test_6
[root@k8s-new-node1 ~]# ifconfig veth_test_6 up
[root@k8s-new-node1 ~]# brctl show my_cni1
bridge name bridge id STP enabled interfaces
my_cni1 8000.0290ebc217bb no veth_test_6
[root@k8s-new-node1 ~]# ifconfig my_cni1 192.168.89.10/24 up
[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.89.10
PING 192.168.89.10 (192.168.89.10) 56(84) bytes of data.
64 bytes from 192.168.89.10: icmp_seq=1 ttl=64 time=0.297 ms
[root@k8s-new-node1 ~]# ip netns exec net3 route add default gw 192.168.89.10
[root@k8s-new-node1 ~]# iptables -P FORWARD ACCEPT
[root@k8s-new-master ~]# ip link add my_vtep0 type vxlan id 200 dstport 4789 local 192.168.122.14 dev ens3 nolearning # 自动创建的mac地址a6:d3:23:dd:03:6f
[root@k8s-new-master ~]# ip link set my_vtep0 up
[root@k8s-new-master ~]# ip addr add 192.168.88.0/32 dev my_vtep0
[root@k8s-new-master ~]# ip route add 192.168.89.0/24 via 192.168.89.0 dev my_vtep0 onlink
[root@k8s-new-master ~]# ip neigh add 192.168.89.0 lladdr b2:07:fc:b6:82:a7 dev my_vtep0 # 这一步需要等node1创建完my_vtep1后再填入mac地址
[root@k8s-new-master ~]# bridge fdb append b2:07:fc:b6:82:a7 dev my_vtep0 dst 192.168.122.15 # 这一步需要等node1创建完my_vtep1后再填入mac地址
[root@k8s-new-master cni]# ip -d link show
18: my_vtep0: mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether a6:d3:23:dd:03:6f brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 200 local 192.168.122.14 dev ens3 srcport 0 0 dstport 4789 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
部分参数说明:
第1条命令:
id: VNI标识是200
local: vxlan隧道使用的源ip
dstport: 指定目的端口为4789。
dev: 指定VTEP通过哪个物理device来通信,这里是使用eth0。
之后主要是up设备, 设置ip地址,添加跨节点、跨网段路由,添加bridge转发表
第2/3条命令分别用于up 设备及设置设备ip
第4条命令指定到对端(node1)节点容器网络192.168.89.0/24使用的网关192.168.89.0,以及使用的设备my_vtep0
第5条命令添加一个arp表项目,包括对端的vtep ip及mac
第6条命令添加提条对端vtep mac地址的转发表,通过对方的公网ip 192.168.122.15
在k8s-new-node1创建vtep设备my_vtep1,步骤同上
[root@k8s-new-node1 ~]# ip link add my_vtep1 type vxlan id 200 dstport 4789 local 192.168.122.15 dev ens3 nolearning # 自动创建的mac地址b2:07:fc:b6:82:a7
[root@k8s-new-node1 ~]# ip link set my_vtep1 up
[root@k8s-new-node1 ~]# ip addr add 192.168.89.0/32 dev my_vtep1
[root@k8s-new-node1 ~]# ip route add 192.168.88.0/24 via 192.168.88.0 dev my_vtep1 onlink
[root@k8s-new-node1 ~]# ip neigh add 192.168.88.0 lladdr a6:d3:23:dd:03:6f dev my_vtep1 # 这一步需要等master创建完my_vtep0后再填入mac地址
[root@k8s-new-node1 ~]# bridge fdb append a6:d3:23:dd:03:6f dev my_vtep1 dst 192.168.122.14 # 这一步需要等master创建完my_vtep0后再填入mac地址
[root@k8s-new-node1 ~]# ip -d link show
16: my_vtep1: mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether b2:07:fc:b6:82:a7 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 200 local 192.168.122.15 dev ens3 srcport 0 0 dstport 4789 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
[root@k8s-new-master flannel]# ip netns exec net1 ping 192.168.89.1 # master上net1到node1的net3
PING 192.168.89.1 (192.168.89.1) 56(84) bytes of data.
64 bytes from 192.168.89.1: icmp_seq=1 ttl=62 time=1.18 ms
[root@k8s-new-node1 ~]# tcpdump -i ens3 -nnev port 4789 # master上net1到node1的net3,在node1上抓包
tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
18:24:03.775317 88:4f:d5:25:80:12 > 88:4f:d5:25:80:13, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 53734, offset 0, flags [none], proto UDP (17), length 134)
192.168.122.14.37379 > 192.168.122.15.4789: VXLAN, flags [I] (0x08), vni 200
a6:d3:23:dd:03:6f > b2:07:fc:b6:82:a7, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 752, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.88.0 > 192.168.89.1: ICMP echo request, id 29566, seq 3, length 64
18:24:03.775542 88:4f:d5:25:80:13 > 88:4f:d5:25:80:12, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 40978, offset 0, flags [none], proto UDP (17), length 134)
192.168.122.15.55264 > 192.168.122.14.4789: VXLAN, flags [I] (0x08), vni 200
b2:07:fc:b6:82:a7 > a6:d3:23:dd:03:6f, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51787, offset 0, flags [none], proto ICMP (1), length 84)
192.168.89.1 > 192.168.88.0: ICMP echo reply, id 29566, seq 3, length 64
[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.88.1 # node1的net3到master的net1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=62 time=1.27 ms
[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.88.2 # node1的net3到master的net2
PING 192.168.88.2 (192.168.88.2) 56(84) bytes of data.
64 bytes from 192.168.88.2: icmp_seq=1 ttl=62 time=1.08 ms
[root@k8s-new-master ~]# ping 192.168.89.1 # master到node1的net3
PING 192.168.89.1 (192.168.89.1) 56(84) bytes of data.
64 bytes from 192.168.89.1: icmp_seq=1 ttl=63 time=4.45 ms
[root@k8s-new-node1 ~]# ping 192.168.88.1 # node1到master的net1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=63 time=0.969 ms
flannel常用后端包括udp、vxlan、host-gw等, 要对比这些后端的优缺点,需要简单的了解下它们的实现:它们都是应用在跨节点容器间通信。
首先是udp封装:简单理解就是,将三层的ip报文封装在一个udp报文中。其中三层的两个ip分别在不同节点的容器上。数据流程:
节点1的pod A(容器)->cni0->flannel0(tun设备)->flanneld:8285->eth0(节点1的公网ip所在网卡)->internel
-> eth1(节点2的公网ip所在网卡)->flanneld:8225->flannel0(tun设备)->cni0->pod B(容器)
接下来是vxlan,这个方案上面已经介绍了,vxlan将二层报文封装在udp里面。二层报文的ip层的两个ip也是在不同节点的容器上。数据流程:
节点1的pod A(容器)->cni0->flannel.1(VTEP Virtual Tunnel End Point设备)->eth0(节点1的公网ip所在网卡)->internel
-> eth1(节点2的公网ip所在网卡)->flannel.1->cni0->pod B(容器)
最后是host-gw, 这个方案通过增加路由来将报文转发到对应的节点上。数据流程:
节点1的pod A(容器)->cni0->eth0(节点1的公网ip所在网卡)->internel
-> eth1(节点2的公网ip所在网卡)->cni0->pod B(容器)
综上,目前vxlan上实际上已经完全可以替换掉udp模式,所以本文后面讲只介绍vxlan及host-gw,由于当前实际使用场景,我们既希望性能上高,也希望在二层不可达的时候使用vxlan, 所以在vxlan模式中,实际上有个DirectRouting选项,开启后,如果二层可达,自动使用host-gw,否则使用vxlan进行跨节点容器间通信。
所以介绍vxlan的实现时,外加DirectRouting选项,就覆盖了flannel的常用后端。
func main() {
//第一步是确认网卡
if opts.version { //输出版本信息
fmt.Fprintln(os.Stderr, version.Version)
os.Exit(0)
}
flagutil.SetFlagsFromEnv(flannelFlags, "FLANNELD")
// Validate flags
if opts.subnetLeaseRenewMargin >= 24*60 || opts.subnetLeaseRenewMargin <= 0 { //参数检查,, 子网续约时间不能大于1天 单位是分钟
log.Error("Invalid subnet-lease-renew-margin option, out of acceptable range")
os.Exit(1)
}
// Work out which interface to use
var extIface *backend.ExternalInterface
var err error
// Check the default interface only if no interfaces are specified
if len(opts.iface) == 0 && len(opts.ifaceRegex) == 0 { //没有指定网卡,则自己查找
extIface, err = LookupExtIface(opts.publicIP, "") //查找网卡,这里想要知道详细的可以查阅香缎源码
if err != nil {
log.Error("Failed to find any valid interface to use: ", err)
os.Exit(1)
}
} else {
// Check explicitly specified interfaces //有指定网卡,使用对应名称的网卡
for _, iface := range opts.iface {
extIface, err = LookupExtIface(iface, "")
if err != nil {
log.Infof("Could not find valid interface matching %s: %s", iface, err)
}
if extIface != nil {
break
}
}
// Check interfaces that match any specified regexes
if extIface == nil { //用户通过正则表达式指定网卡名称
for _, ifaceRegex := range opts.ifaceRegex {
extIface, err = LookupExtIface("", ifaceRegex)
if err != nil {
log.Infof("Could not find valid interface matching %s: %s", ifaceRegex, err)
}
if extIface != nil {
break
}
}
}
if extIface == nil { //没有找到合适的网卡 直接退出
// Exit if any of the specified interfaces do not match
log.Error("Failed to find interface to use that matches the interfaces and/or regexes provided")
os.Exit(1)
}
}
//第二步创建子网管理对象, 用于持久化存储功能,存储管理方式目前支持: kubernetes api-server或者etcd
sm, err := newSubnetManager() //创建子网管理对象
if err != nil {
log.Error("Failed to create SubnetManager: ", err)
os.Exit(1)
}
log.Infof("Created subnet manager: %s", sm.Name())
// Register for SIGINT and SIGTERM
log.Info("Installing signal handlers")
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, os.Interrupt, syscall.SIGTERM)
// This is the main context that everything should run in.
// All spawned goroutines should exit when cancel is called on this context.
// Go routines spawned from main.go coordinate using a WaitGroup. This provides a mechanism to allow the shutdownHandler goroutine
// to block until all the goroutines return . If those goroutines spawn other goroutines then they are responsible for
// blocking and returning only when cancel() is called.
ctx, cancel := context.WithCancel(context.Background()) //创建一个可以被cancel的ctx
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
shutdownHandler(ctx, sigs, cancel)
wg.Done()
}()
if opts.healthzPort > 0 {
// It's not super easy to shutdown the HTTP server so don't attempt to stop it cleanly
go mustRunHealthz()
}
//第三步: 创建网卡并且激活
// Fetch the network config (i.e. what backend to use etc..).
config, err := getConfig(ctx, sm)
if err == errCanceled {
wg.Wait()
os.Exit(0)
}
// Create a backend manager then use it to create the backend and register the network with it.
bm := backend.NewManager(ctx, sm, extIface) //创建后端的manager对象
be, err := bm.GetBackend(config.BackendType)
if err != nil {
log.Errorf("Error fetching backend: %s", err)
cancel()
wg.Wait()
os.Exit(1)
}
bn, err := be.RegisterNetwork(ctx, wg, config) //执行后端的注册网络函数,对应vxlan就是vxlan.go文件中RegisterNetwork,后面会详细介绍vxlan及host-gw后端
if err != nil {
log.Errorf("Error registering network: %s", err)
cancel()
wg.Wait()
os.Exit(1)
}
// Set up ipMasq if needed
if opts.ipMasq { //根据配置,判断是否需要开启ip-masquerade
if err = recycleIPTables(config.Network, bn.Lease()); err != nil {
log.Errorf("Failed to recycle IPTables rules, %v", err)
cancel()
wg.Wait()
os.Exit(1)
}
log.Infof("Setting up masking rules")
go network.SetupAndEnsureIPTables(network.MasqRules(config.Network, bn.Lease()), opts.iptablesResyncSeconds) //创建iptables策略(ip-masquerade)
}
// Always enables forwarding rules. This is needed for Docker versions >1.13 (https://docs.docker.com/engine/userguide/networking/default_network/container-communication/#container-communication-between-hosts)
// In Docker 1.12 and earlier, the default FORWARD chain policy was ACCEPT.
// In Docker 1.13 and later, Docker sets the default policy of the FORWARD chain to DROP.
if opts.iptablesForwardRules { //设置转发策略
log.Infof("Changing default FORWARD chain policy to ACCEPT")
go network.SetupAndEnsureIPTables(network.ForwardRules(config.Network.String()), opts.iptablesResyncSeconds)
}
if err := WriteSubnetFile(opts.subnetFile, config.Network, opts.ipMasq, bn); err != nil { //保存到配置文件中
// Continue, even though it failed.
log.Warningf("Failed to write subnet file: %s", err)
} else {
log.Infof("Wrote subnet file to %s", opts.subnetFile)
}
// Start "Running" the backend network. This will block until the context is done so run in another goroutine.
log.Info("Running backend.")
wg.Add(1)
go func() {
bn.Run(ctx) //如果是vxlan网络 执行的是vxlan_network.go中Run
wg.Done()
}()
daemon.SdNotify(false, "READY=1")
//第四步: 启动监控
// Kube subnet mgr doesn't lease the subnet for this node - it just uses the podCidr that's already assigned.
if !opts.kubeSubnetMgr {
//通过etcd管理网络 会进入此函数 此函数是一个死循环
err = MonitorLease(ctx, sm, bn, &wg) //监控该节点 主要用于节点租约过期后 能够快速获取新的租约
if err == errInterrupted {
// The lease was "revoked" - shut everything down
cancel()
}
}
log.Info("Waiting for all goroutines to exit")
// Block waiting for all the goroutines to finish.
wg.Wait()
log.Info("Exiting cleanly...")
os.Exit(0)
}
上面的流程可以简单理解为:第一步是确认网卡,第二步创建子网管理对象,第三步创建网卡并且激活,第四步启动监控
第三步创建网卡并激活,第四部启动监控都会根据配置的后端实现会有不同,下面会介绍。
这里我们主要介绍VTEP虚拟网卡的创建及监听子网添加删除(集群节点添加删除)事件
vxlan backend go语言实现参考链接
我们这里顺着主函数注册网络,简单介绍vxlan后端的实现。
注册网络:
输入参数: 上下文ctx, 子网信息config
输出参数: backend.network后端(vxlan)网络
func (be *VXLANBackend) RegisterNetwork(ctx context.Context, wg sync.WaitGroup, config *subnet.Config) (backend.Network, error) {
// Parse our configuration
cfg := struct {
VNI int
Port int
GBP bool
Learning bool
DirectRouting bool
}{
VNI: defaultVNI,
}
if len(config.Backend) > 0 { //解析配置
if err := json.Unmarshal(config.Backend, &cfg); err != nil {
return nil, fmt.Errorf("error decoding VXLAN backend config: %v", err)
}
}
log.Infof("VXLAN config: VNI=%d Port=%d GBP=%v Learning=%v DirectRouting=%v", cfg.VNI, cfg.Port, cfg.GBP, cfg.Learning, cfg.DirectRouting)
devAttrs := vxlanDeviceAttrs{ //VXLAN设备属性
vni: uint32(cfg.VNI),
name: fmt.Sprintf("flannel.%v", cfg.VNI),
vtepIndex: be.extIface.Iface.Index,
vtepAddr: be.extIface.IfaceAddr,
vtepPort: cfg.Port,
gbp: cfg.GBP,
learning: cfg.Learning,
}
dev, err := newVXLANDevice(&devAttrs) //创建VXLAN设备
if err != nil {
return nil, err
}
dev.directRouting = cfg.DirectRouting
subnetAttrs, err := newSubnetAttrs(be.extIface.ExtAddr, dev.MACAddr()) //创建子网属性
if err != nil {
return nil, err
}
lease, err := be.subnetMgr.AcquireLease(ctx, subnetAttrs) //获取租约获取租约
switch err {
case nil:
case context.Canceled, context.DeadlineExceeded:
return nil, err
default:
return nil, fmt.Errorf("failed to acquire lease: %v", err)
}
// Ensure that the device has a /32 address so that no broadcast routes are created.
// This IP is just used as a source address for host to workload traffic (so
// the return path for the traffic has an address on the flannel network to use as the destination)
if err := dev.Configure(ip.IP4Net{IP: lease.Subnet.IP, PrefixLen: 32}); err != nil { //设置ip并up起来
return nil, fmt.Errorf("failed to configure interface %s: %s", dev.link.Attrs().Name, err)
}
return newNetwork(be.subnetMgr, be.extIface, dev, ip.IP4Net{}, lease) //new Network结构体
}
简单来说就是创建VXLAN设备、获取租约信息、vxlan配置ip、返回对象。
接下来简单说下创建VXLAN设备、获取租约信息的流程。
首先是创建VXLAN设备:
输入参数: devAttrs 设备属性
输出参数:返回vxlanDevice对象
func newVXLANDevice(devAttrs *vxlanDeviceAttrs) (*vxlanDevice, error) {
link := &netlink.Vxlan{
LinkAttrs: netlink.LinkAttrs{
Name: devAttrs.name,
},
VxlanId: int(devAttrs.vni),
VtepDevIndex: devAttrs.vtepIndex,
SrcAddr: devAttrs.vtepAddr,
Port: devAttrs.vtepPort,
Learning: devAttrs.learning,
GBP: devAttrs.gbp,
}
link, err := ensureLink(link) //创建VXLAN设备
if err != nil {
return nil, err
}
_, _ = sysctl.Sysctl(fmt.Sprintf("net/ipv6/conf/%s/accept_ra", devAttrs.name), "0")
return &vxlanDevice{
link: link,
}, nil
}
func ensureLink(vxlan *netlink.Vxlan) (*netlink.Vxlan, error) {
err := netlink.LinkAdd(vxlan)
if err == syscall.EEXIST {
// it's ok if the device already exists as long as config is similar
log.V(1).Infof("VXLAN device already exists")
existing, err := netlink.LinkByName(vxlan.Name) //获取已有vxlan设备信息
if err != nil {
return nil, err
}
incompat := vxlanLinksIncompat(vxlan, existing) //比较新旧网卡信息
if incompat == "" {
log.V(1).Infof("Returning existing device")
return existing.(*netlink.Vxlan), nil
}
// delete existing
log.Warningf("%q already exists with incompatable configuration: %v; recreating device", vxlan.Name, incompat) //不相同则删除
if err = netlink.LinkDel(existing); err != nil {
return nil, fmt.Errorf("failed to delete interface: %v", err)
}
// create new
if err = netlink.LinkAdd(vxlan); err != nil { //创建新的vxlan设备
return nil, fmt.Errorf("failed to create vxlan interface: %v", err)
}
} else if err != nil {
return nil, err
}
ifindex := vxlan.Index
link, err := netlink.LinkByIndex(vxlan.Index) //根据索引进行查找设备
if err != nil {
return nil, fmt.Errorf("can't locate created vxlan device with index %v", ifindex)
}
var ok bool
if vxlan, ok = link.(*netlink.Vxlan); !ok {
return nil, fmt.Errorf("created vxlan device with index %v is not vxlan", ifindex)
}
return vxlan, nil
}
上面通过第三方库netlink.LinkAdd函数创建vxlan设备,感兴趣请查阅相关代码
然后是获取租约信息:
输入参数: 上下文ctx, 属性信息
输出参数: 租约信息
有两个地方有该代码etcdv2/local_manager.go/kube/kube.go, 这里贴etcdv2的代码
func (m *LocalManager) AcquireLease(ctx context.Context, attrs *LeaseAttrs) (*Lease, error) {
config, err := m.GetNetworkConfig(ctx) //获取配置信息, 向etcd查询相关信息
if err != nil {
return nil, err
}
for i := 0; i < raceRetries; i++ {
l, err := m.tryAcquireLease(ctx, config, attrs.PublicIP, attrs)
switch err {
case nil:
return l, nil
case errTryAgain:
continue
default:
return nil, err
}
}
return nil, errors.New("Max retries reached trying to acquire a subnet")
}
//输入参数:上下文ctx, config配置信息, 外部ip, 租约信息属性
//输出参数:返回租约对象
func (m *LocalManager) tryAcquireLease(ctx context.Context, config *Config, extIaddr ip.IP4, attrs *LeaseAttrs) (*Lease, error) {
leases, _, err := m.registry.getSubnets(ctx)
if err != nil {
return nil, err
}
// Try to reuse a subnet if there's one that matches our IP
if l := findLeaseByIP(leases, extIaddr); l != nil { //向etcd查找是否已经存在租约信息
// Make sure the existing subnet is still within the configured network
if isSubnetConfigCompat(config, l.Subnet) {
log.Infof("Found lease (%v) for current IP (%v), reusing", l.Subnet, extIaddr)
ttl := time.Duration(0)
if !l.Expiration.IsZero() {
// Not a reservation
ttl = subnetTTL
}
exp, err := m.registry.updateSubnet(ctx, l.Subnet, attrs, ttl, 0) //更新子网
if err != nil {
return nil, err
}
l.Attrs = *attrs
l.Expiration = exp
return l, nil
} else {
log.Infof("Found lease (%v) for current IP (%v) but not compatible with current config, deleting", l.Subnet, extIaddr)
if err := m.registry.deleteSubnet(ctx, l.Subnet); err != nil { //删除已有子网
return nil, err
}
}
}
// no existing match, check if there was a previous subnet to use
var sn ip.IP4Net
if !m.previousSubnet.Empty() {
// use previous subnet //逻辑同上,使用/run/flannel/subnet.env
if l := findLeaseBySubnet(leases, m.previousSubnet); l != nil {
// Make sure the existing subnet is still within the configured network
if isSubnetConfigCompat(config, l.Subnet) {
log.Infof("Found lease (%v) matching previously leased subnet, reusing", l.Subnet)
ttl := time.Duration(0)
if !l.Expiration.IsZero() {
// Not a reservation
ttl = subnetTTL
}
exp, err := m.registry.updateSubnet(ctx, l.Subnet, attrs, ttl, 0)
if err != nil {
return nil, err
}
l.Attrs = *attrs
l.Expiration = exp
return l, nil
} else {
log.Infof("Found lease (%v) matching previously leased subnet but not compatible with current config, deleting", l.Subnet)
if err := m.registry.deleteSubnet(ctx, l.Subnet); err != nil {
return nil, err
}
}
} else {
// Check if the previous subnet is a part of the network and of the right subnet length
if isSubnetConfigCompat(config, m.previousSubnet) {
log.Infof("Found previously leased subnet (%v), reusing", m.previousSubnet)
sn = m.previousSubnet
} else {
log.Errorf("Found previously leased subnet (%v) that is not compatible with the Etcd network config, ignoring", m.previousSubnet)
}
}
}
if sn.Empty() { //以上两种查询都没有满足
// no existing match, grab a new one
sn, err = m.allocateSubnet(config, leases) //创建一个新的子网
if err != nil {
return nil, err
}
}
exp, err := m.registry.createSubnet(ctx, sn, attrs, subnetTTL) //向etcd存储信息 存活时间是24h 这样etcd中就有subnets信息
switch {
case err == nil:
log.Infof("Allocated lease (%v) to current node (%v) ", sn, extIaddr)
return &Lease{
Subnet: sn,
Attrs: *attrs,
Expiration: exp,
}, nil
case isErrEtcdNodeExist(err):
return nil, errTryAgain
default:
return nil, err
}
}
简单来说,就是根据子网信息,向etcd或者/run/flannel/subnet.env查询是否存在,不存在则注册相应的子网信息到etcd。
func (nw *network) Run(ctx context.Context) {
wg := sync.WaitGroup{}
log.V(0).Info("watching for new subnet leases")
events := make(chan []subnet.Event)
wg.Add(1)
go func() {
subnet.WatchLeases(ctx, nw.subnetMgr, nw.SubnetLease, events) //对所有租约进行监控 调用watch.go 中WatchLeases函数
// WatchLeases->watchSubnets
log.V(1).Info("WatchLeases exited") //获取etcd数据,阻塞方式
wg.Done()
}()
defer wg.Wait()
for { //死循环处理所有事件
select {
case evtBatch := <-events:
nw.handleSubnetEvents(evtBatch) //有事件发生执行处理函数
case <-ctx.Done():
return
}
}
}
func (nw *network) handleSubnetEvents(batch []subnet.Event) {
for _, event := range batch {
sn := event.Lease.Subnet
attrs := event.Lease.Attrs
if attrs.BackendType != "vxlan" {
log.Warningf("ignoring non-vxlan subnet(%s): type=%v", sn, attrs.BackendType)
continue
}
var vxlanAttrs vxlanLeaseAttrs
if err := json.Unmarshal(attrs.BackendData, &vxlanAttrs); err != nil { //解析json格式数据,从etcd获取的返回值
log.Error("error decoding subnet lease JSON: ", err)
continue
}
// This route is used when traffic should be vxlan encapsulated
vxlanRoute := netlink.Route{ //跨节点容器间通信:vxlan封装的路由表
LinkIndex: nw.dev.link.Attrs().Index,
Scope: netlink.SCOPE_UNIVERSE,
Dst: sn.ToIPNet(),
Gw: sn.IP.ToIP(),
}
vxlanRoute.SetFlag(syscall.RTNH_F_ONLINK)
// directRouting is where the remote host is on the same subnet so vxlan isn't required. //跨节点容器间通信(vxlan+directrouting=host-gw):如果有开启directrouting并且节点间同子网,则直接路由
directRoute := netlink.Route{
Dst: sn.ToIPNet(),
Gw: attrs.PublicIP.ToIP(),
}
var directRoutingOK = false
if nw.dev.directRouting {
if dr, err := ip.DirectRouting(attrs.PublicIP.ToIP()); err != nil {
log.Error(err)
} else {
directRoutingOK = dr
}
}
switch event.Type {
case subnet.EventAdded:
if directRoutingOK { //直接路由方式(vxlan+directrouting = host-gw): 只要添加路由表
log.V(2).Infof("Adding direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
if err := netlink.RouteReplace(&directRoute); err != nil {
log.Errorf("Error adding route to %v via %v: %v", sn, attrs.PublicIP, err)
continue
}
} else { //vxlan方式: 添加arp表,添加fdb转发表,更新路由表
log.V(2).Infof("adding subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
if err := nw.dev.AddARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil { //添加arp表
log.Error("AddARP failed: ", err)
continue
}
if err := nw.dev.AddFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil { //添加fdb转发表
log.Error("AddFDB failed: ", err)
// Try to clean up the ARP entry then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
continue
}
// Set the route - the kernel would ARP for the Gw IP address if it hadn't already been set above so make sure
// this is done last.
if err := netlink.RouteReplace(&vxlanRoute); err != nil { //更新路由表
log.Errorf("failed to add vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
// Try to clean up both the ARP and FDB entries then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
if err := nw.dev.DelFDB(neighbor{IP: event.Lease.Attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}
continue
}
}
case subnet.EventRemoved:
if directRoutingOK {
log.V(2).Infof("Removing direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
if err := netlink.RouteDel(&directRoute); err != nil { //直接路由方式(vxlan+directrouting = host-gw): 只要删除路由表
log.Errorf("Error deleting route to %v via %v: %v", sn, attrs.PublicIP, err)
}
} else {
log.V(2).Infof("removing subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
// Try to remove all entries - don't bail out if one of them fails. //vxlan方式: 删除arp表,删除fdb转发表,删除路由表
if err := nw.dev.DelARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
if err := nw.dev.DelFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}
if err := netlink.RouteDel(&vxlanRoute); err != nil {
log.Errorf("failed to delete vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
}
}
default:
log.Error("internal error: unknown event type: ", int(event.Type))
}
}
}
原理同vxlan+directRouting这里不介绍
net-conf.json: |
{
"Network": "192.16.0.0/16",
"Backend": {
"Type": "vxlan",
"DirectRouting": true
}
}
如果集群节点间属于同网段网络,那么它们是二层可达,此时这两个节点上的跨节点容器网络间通信会自动采用host-gw的方式,也就是直接路由。
以下是master节点生成的路由,其它节点同理:
[root@k8s-new-master flannel]# ip route |grep "192.16\."
192.16.0.0/24 dev cni0 proto kernel scope link src 192.16.0.1
192.16.1.0/24 via 192.168.122.15 dev ens3
192.16.2.0/24 via 192.168.122.16 dev ens3
我们这里通过抓包看下报文的格式,以下是node1上的pod容器ip 192.16.1.58到node2上的pod容器192.16.2.19的抓包,抓包在node2的公网口上:
[root@k8s-new-node2 ~]# tcpdump -i ens3 host 192.16.1.58 -nnev
19:54:33.590095 88:4f:d5:25:80:13 > 88:4f:d5:25:80:14, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38285, offset 0, flags [DF], proto ICMP (1), length 84)
192.16.1.58 > 192.16.2.19: ICMP echo request, id 28140, seq 1, length 64
19:54:33.590364 88:4f:d5:25:80:14 > 88:4f:d5:25:80:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 12163, offset 0, flags [none], proto ICMP (1), length 84)
192.16.2.19 > 192.16.1.58: ICMP echo reply, id 28140, seq 1, length 64
[root@k8s-new-node2 ~]# ifconfig ens3
ens3: flags=4163 mtu 1500
inet 192.168.122.16 netmask 255.255.255.0 broadcast 192.168.122.255
ether 88:4f:d5:25:80:14 txqueuelen 1000 (Ethernet)
[root@k8s-new-node1 ns_tools]# ifconfig ens3
ens3: flags=4163 mtu 1500
inet 192.168.122.15 netmask 255.255.255.0 broadcast 192.168.122.255
ether 88:4f:d5:25:80:13 txqueuelen 1000 (Ethernet)
显然报文是直接替换目的mac地址,ip层的ip没有变动。所以没有走vxlan通道,而是走了host-gw的直接路由
后续会有完整的network policy介绍,这里链接待添加
后续如果有calico等插件,再新增,相对maxvlan的已经在本文开头对比说明。