CNI网络插件之flannel

CNI网络插件之flannel

  • CNI网络插件之flannel
    • flannel插件组成
    • flannel插件安装流程(vxlan)
    • 单节点容器间通信方式(网桥)实验
    • 容器访问外部网络通信方式实验
    • 跨节点容器间通信方式(vxlan)实验
    • flannel常用backend(后端)介绍
      • udp
      • vxlan
      • host-gw
      • 优缺点
    • flannel 插件代码实现
      • 主流程
      • vxlan backend
        • 创建网卡
        • 启动监控
      • host-gw 代码实现
    • vxlan之DirectRouting配置(原理同host-gw)
      • 配置方式
    • flannel插件配合其它插件实现网络策略(Network Policy)
    • flannel插件总结与讨论

CNI网络插件之flannel

上篇文章《CNI插件之CNI插件最简实现之macvlan plugin》我们介绍了macvlan插件,通过使用与分析,我们知道:

  • macvlan插件在集群多节点上,要每个节点都放配置文件, 比如/etc/cni/net.d/10-maclannet.conf, 并且各个节点的子网不能冲突。
  • macvlan插件默认网关的设置上还需要考虑ip是否已经存在,插件并不会自动帮我们设置,需要手动配置。
  • macvlan插件在集群多节点上,跨节点容器间通信上还需要手动配置网段路由。
  • macvlan插件在访问外部网络上,需要再手动配置网关,之后流量部分情况下,还需要走snat规则出公网。
  • macvlan插件的容器内部接口,是基于指定的宿主机主(master)接口,容器内部接口与主(master)接口不能直接通信。
  • macvlan插件是一个underlay的网络技术,网络栈有一定的独立性,安全方面会有一定的限制。

flannel插件的实现上解决了上面列出的5个问题:

  • 只要配置master节点, 自动配置集群各节点的子网,网关。
  • 自动创建cni0网桥,用于单节点容器间的互连,自动设置cni0网卡ip,并作为节点内容器的网关。
  • 根据实际使用的类型udp/vxlan/host-gw 在跨节点上自动配置相应的网络路由以及封装规则。比如udp模式上,创建flannel0(tun)设备,flanneld进配置进行udp外层(公网ip)分装。 vxlan模式下则创建flannel.1(VTEP)设备,设置相应的fdb转发规则等,使用内核vxlan模块进行外层(公网ip)封装。 host-gw模式下,自动配置网段路由进行路由转发。
  • 自动为外部网络的访问,创建NAT规则,用于容器内部访问外部网络。
  • 通过创建veth-pair设备对,一端放在容器内部,另外一端放在cni0网桥上,保障容器内部可以直接与cni0网桥通信。
  • flannel的udp/vxlan属于overlay的网络技术,安全方面有保障,另外也提供了一个性能较高的host-gw方案。

说了这么多优点,那flannel如何部署使用,具体怎样实现的呢?
这也是本篇文章要介绍的,这里罗列下面会介绍的内容:

  • flannel插件组成
  • flannel插件安装流程(vxlan)
  • 单节点容器间通信方式(网桥)实验
  • 容器访问外部网络通信方式实验
  • 跨节点容器间通信方式(vxlan)实验
  • flannel常用backend(后端)介绍
  • flannel 插件代码实现
    • 主流程,vxlan backend,host-gw backend等
  • vxlan之DirectRouting配置(原理同host-gw)
  • flannel插件配合其它插件实现网络策略(Network Policy)
  • flannel插件总结与讨论

flannel插件组成

flannel网络插件实现依赖的技术包括:

  • 实现桥接使用的cni插件bridge,实际实现桥接功能的内核bridge
  • vxlan模式即 Virtual Extensible LAN(虚拟可扩展局域网)的内核实现,该部分部分依赖flanneld
  • udp模式下依赖的外层UDP封装实现flanneld进程
  • host-gw模式使用到的内核路由表实现
  • 出公网需要的NAT的实现使用的内核iptables规则
  • 保证每个节点存在一个副本的DaemonSet:kube-flannel-ds-amd64
  • 自动同步到每个节点配置的配置ConfigMap:kube-flannel-cfg
  • 账户相关的ServiceAccount:flannel //RBAC
  • 集群角色相关的ClusterRole:flannel //RBAC
  • 集群角色权限授予相关的ClusterRoleBinding:flannel //RBAC
  • POD安全策略相关的PodSecurityPolicy:psp.flannel.unprivileged该资源后面分配给flannel对应的ClusterRole //RBAC

上面DaemonSet,ConfigMap,RBAC相关的内容后续会出相应的章节介绍,感兴趣的跳转链接阅读(链接待添加)

这些组成大部分可以从yaml配置文件获取到,我们给出上面的链接对应的配置文件,以及简单的注释:

---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy				#POD节点安全策略相关
metadata:
  name: psp.flannel.unprivileged
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
spec:
  privileged: true
  volumes:
    - configMap
    - secret
    - emptyDir
    - hostPath
  allowedHostPaths:			#宿主机目录权限设置
    - pathPrefix: "/etc/cni/net.d"
    - pathPrefix: "/etc/kube-flannel"
    - pathPrefix: "/run/flannel"
  readOnlyRootFilesystem: false
  # Users and groups
  runAsUser:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  # Privilege Escalation
  allowPrivilegeEscalation: false
  defaultAllowPrivilegeEscalation: false
  # Capabilities
  allowedCapabilities: ['NET_ADMIN']
  defaultAddCapabilities: []
  requiredDropCapabilities: []
  # Host namespaces
  hostPID: false
  hostIPC: false
  hostNetwork: true
  hostPorts:
  - min: 0
    max: 65535
  # SELinux
  seLinux:
    # SELinux is unused in CaaSP
    rule: 'RunAsAny'
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: flannel						#ClusterRole角色
rules:
  - apiGroups: ['extensions']
    resources: ['podsecuritypolicies']			#权限资源类型
    verbs: ['use']
    resourceNames: ['psp.flannel.unprivileged']		#权限资源名称
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - nodes/status
    verbs:
      - patch
---
kind: ClusterRoleBinding				#权限绑定,给flannel(ServiceAccount)绑定flannel(ClusterRole)角色的权限
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: flannel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
subjects:
- kind: ServiceAccount
  name: flannel
  namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount			#创建ServiceAccount 账号
metadata:
  name: flannel
  namespace: kube-system
---
kind: ConfigMap		#用于保存配置信息的键值对,主要用于给容器内应用程序提供配置
apiVersion: v1
metadata:
  name: kube-flannel-cfg			#这里定义了kube-flannel-cfg这个configmap 后面以存储卷的形式提供给后面的DaemonSet
  namespace: kube-system
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",			#cni插件类型
          "delegate": {					#委托,这里实际调用的是bridge插件
            "hairpinMode": true,		#支持hairpinMode 用于实现pod访问集群服务后,重新负载均衡到本pod。
            "isDefaultGateway": true	#设置cni0网关ip,同时设置pod节点默认网关为cni0的ip,同bridge插件说明。
          }
        },
        {
          "type": "portmap",		#级联插件用于实现类似端口映射,nat的功能。
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "192.16.0.0/16",		#集群pod节点使用的网络网段
      "Backend": {
        "Type": "vxlan"			#backend的类型,这里使用vxlan,还可以udp/host-gw等
      }
    }
---
apiVersion: apps/v1
kind: DaemonSet					#DaemonSet保障集群各个节点有一个副本
metadata:
  name: kube-flannel-ds-amd64
  namespace: kube-system
  labels:
    tier: node
    app: flannel
spec:
  selector:
    matchLabels:
      app: flannel
  template:
    metadata:
      labels:
        tier: node
        app: flannel
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - linux
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - amd64
      hostNetwork: true
      tolerations:
      - operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni
        image: quay.io/coreos/flannel:v0.12.0-amd64			#使用的flannel镜像版本
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist		#容器应用输入的cni配置文件
        volumeMounts:
        - name: cni
          mountPath: /etc/cni/net.d
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      containers:
      - name: kube-flannel
        image: quay.io/coreos/flannel:v0.12.0-amd64
        command:
        - /opt/bin/flanneld		#容器应用二进制 flanneld
        args:
        - --ip-masq			#代表处公网要走snat
        - --kube-subnet-mgr		#代表使用kube的subnet-manager,有别于etcd的subnet-manager,该类型基于k8s的节点CIDR
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
          limits:
            cpu: "100m"
            memory: "50Mi"
        securityContext:
          privileged: true
          capabilities:
            add: ["NET_ADMIN"]
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        volumeMounts:
        - name: run
          mountPath: /run/flannel
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      volumes:
        - name: run
          hostPath:
            path: /run/flannel				#运行相关目录
        - name: cni
          hostPath:
            path: /etc/cni/net.d			#cni插件配置目录
        - name: flannel-cfg
          configMap:
            name: kube-flannel-cfg			#使用的configmap配置

flannel插件安装流程(vxlan)

由于我们之前已经安装了macvlan了,并且部分容器已经添加进了macvlan创建的网络。所以这里我们在使用flannel插件前,先要重置网络:
重置详细命令较多,后面会有一个安装/重置的章节专门说明,如果使用前一章介绍的macvlan的方式安装,可以通过这个方式重置:链接

安装flannel插件,相对maxvlan插件,是将配置文件直接写在yaml里面,我们这里提供了一个典型的flannel yaml配置,这个配置和上一章介绍的yaml文件是一致的。

下载下来后,只要执行:

kubectl apply -f kube-falannel.yml

配置文件里面有一个比较关键的配置:

  net-conf.json: |
    {
      "Network": "192.16.0.0/16",	//集群pod节点网络
      "Backend": {
        "Type": "vxlan"				//flannel网络类型,可以vxlan/udp/host-gw等
      }
    }

配置文件里面的其它部分我们后面再做一个整体的介绍。

运行过后,集群各节点就会从NotReady变成Ready节点状态。

集群各节点会看到cni0网桥,连接到cni0网桥的veth设备

[root@k8s-new-master flannel]# ifconfig cni
cni0: flags=4163  mtu 1450
        inet 192.16.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::8c45:9bff:feb9:8700  prefixlen 64  scopeid 0x20
        ether 8e:45:9b:b9:87:00  txqueuelen 1000  (Ethernet)
        RX packets 2699334  bytes 233169100 (222.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2753084  bytes 650775039 (620.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
[root@k8s-new-master flannel]# brctl  show
bridge name	bridge id		STP enabled	interfaces
cni0		8000.1a64c8fcc7c5	no		veth501950ba
										veth9abcf99e

veth设备对,其中sh-4.2# 表示在容器里面,容器里面eth0后面的@6与宿主机的编号6接口是一对直连,同理另外一个容器里面eth0后面的@7与宿主机的编号7接口也是直连的。


[root@k8s-new-master flannel]# ip link
6: veth501950ba@if3:  mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default 
    link/ether ba:c0:8d:41:3f:30 brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: veth9abcf99e@if3:  mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default 
    link/ether 62:c1:58:cb:5e:14 brd ff:ff:ff:ff:ff:ff link-netnsid 1

sh-4.2# ip addr
3: eth0@if6:  mtu 1450 qdisc noqueue state UP group default 
    link/ether 76:41:a1:96:53:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.16.0.72/24 scope global eth0
       valid_lft forever preferred_lft forever

sh-4.2# ip addr
3: eth0@if7:  mtu 1450 qdisc noqueue state UP group default 
    link/ether 4a:aa:c6:b8:5a:12 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.16.0.73/24 scope global eth0
       valid_lft forever preferred_lft forever

flanneld进程
这里指定ip-masq表示,访问外部网络所需的nat规则由flanneld进程创建,bridge插件那边要关闭nat规则的创建。

[root@k8s-new-master flannel]# ps -aux |grep flanneld
root     19979  0.1  0.2 621916 19036 ?        Ssl  Jul04   2:41 /opt/bin/flanneld --ip-masq --kube-subnet-mgr
root     31718  0.0  0.0 112712   940 pts/1    S+   22:58   0:00 grep --color=auto flanneld

看下flannel自动生成的完整插件配置:

cat /var/lib/cni/flannel/3153d1047e5ac34b276123db3b80eeed35320933778dfe5308ddfeaa84299c72
{
	"cniVersion":"0.3.1",
	"hairpinMode":true,				#发夹模式,支持单个pod节点请求,最后负载均衡到本pod
	"ipMasq":false,					#关闭bridge生成访问外网的nat规则
	"ipam":
	{
		"routes":[{"dst":"192.16.0.0/16"}],
		"subnet":"192.16.0.0/24",
		"type":"host-local"			#ip分配管理插件类型:host-local
	},
	"isDefaultGateway":true,
	"isGateway":true,			#自动设置网关ip到网桥cni0上,自动在容器内部添加默认网关路由
	"mtu":1450,
	"name":"cbr0",
	"type":"bridge"				#cni插件类型bridge
}

接下来是使用vxlan作为backend(后端)所创建的信息。大概包括4个核心信息:

  • flannel.1的Virtual Tunnel End Point (VTEP) 设备
  • 跨节点访问路由表
  • 跨节点访问arp表(邻居表)
  • 跨节点访问fdb表(转发表)

首先是flannel.1 (VTEP)虚拟设备,VTEP设备参数如下:

  • VNI标识: vid:1
  • 关闭自学习,不自动学习其他VTEP的mac: nolearning
  • 通过vxlan隧道发送出去使用的本地ip: local:192.168.122.14
  • 隧道使用的端口: port:8472
[root@k8s-new-master flannel]# ifconfig flannel.1
flannel.1: flags=4163  mtu 1450
        inet 192.16.0.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::2077:b4ff:fee8:3e6f  prefixlen 64  scopeid 0x20
        ether 22:77:b4:e8:3e:6f  txqueuelen 0  (Ethernet)
        RX packets 467309  bytes 34387845 (32.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 436038  bytes 64505265 (61.5 MiB)
        TX errors 0  dropped 24 overruns 0  carrier 0  collisions 0
        
[root@k8s-new-master flannel]# ip -d link show
4: flannel.1:  mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether fa:8f:c5:04:ab:97 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    vxlan id 1 local 192.168.122.14 dev ens3 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

接下来看下跨节点访问所需的路由表
这里先说下我们集群pod节点分配的ip网段:
master:192.16.0.0/24
node1:192.16.1.0/24
node2:12.16.2.0/24
所以在master查看到,到192.16.1.0/24 及192.16.2.0/24两个网段需要走vxlan设备flannel.1。

这里查询结果里面的onlink标志,onlink 参数表明强制此网关是“在链路上”的 (虽然并没有链路层路由),否则 linux 上面是没法添加不同网段的路由。这样数据包就能知道,如果是容器直接的访问则交给 flannel.1 设备处理。

这样跨节点容器间访问时(192.16.0.1->192.16.1.1),数据首先在容器内走默认网关到cni0网桥,然后走路由到flannel.1设备,接着封装目的二层信息,这里目的mac应该选谁?

[root@k8s-new-master flannel]# route -n |grep flannel.1
192.16.1.0      192.16.1.0      255.255.255.0   UG    0      0        0 flannel.1
192.16.2.0      192.16.2.0      255.255.255.0   UG    0      0        0 flannel.1

[root@k8s-new-master flannel]# ip route show dev flannel.1
192.16.1.0/24 via 192.16.1.0 onlink 
192.16.2.0/24 via 192.16.2.0 onlink

答案是填对端的VTEP设备的mac地址,又IP查询mac地址依赖的是arp表。所以flanneld进程会为每个加入集群的VTEP设备添加一个arp表项,permannent永久的。

[root@k8s-new-master flannel]# arp -an |grep flannel.1
? (192.16.1.0) at 2e:2a:a5:7c:e8:f2 [ether] PERM on flannel.1
? (192.16.2.0) at 7a:50:9c:c8:99:d7 [ether] PERM on flannel.1

[root@k8s-new-master flannel]# ip neig show dev flannel.1
192.16.1.0 lladdr 2e:2a:a5:7c:e8:f2 PERMANENT
192.16.2.0 lladdr 7a:50:9c:c8:99:d7 PERMANENT

我们知道,vxlan是一个将二层帧封装在udp里面的数据包,填充完了二层,我们如何知道这个包要发送给谁? 创建VTEP的时候,我们指定了发送的源IP(宿主机ip),端口信息,那么目的IP端口信息显然就是对端的宿主机ip,这个信息其实是被flanneld进程静态的添加进转发表里面,bridge fdb里面存着一个到目的mac地址(这里是目的VTEP mac)所需的目的IP(目的宿主IP),注意这里也是permanent永久性的。

[root@k8s-new-master flannel]# bridge fdb |grep flan
2e:2a:a5:7c:e8:f2 dev flannel.1 dst 192.168.122.15 self permanent
7a:50:9c:c8:99:d7 dev flannel.1 dst 192.168.122.16 self permanent

接下来我们看下flannel生成的各节点网段等信息,以master为例,与节点的CIDR一致

[root@k8s-new-master ns_tools]# cat /var/run/flannel/subnet.env 
FLANNEL_NETWORK=192.16.0.0/16
FLANNEL_SUBNET=192.16.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

[root@k8s-new-master ns_tools]# kubectl describe node k8s-new-master |grep CIDR
PodCIDR:                     192.16.0.0/24

自动生成pod内容器访问外网所需的iptables规则:

[root@k8s-new-node2 ~]# iptables -S -t nat
-A POSTROUTING -s 192.16.0.0/16 -d 192.16.0.0/16 -j RETURN
-A POSTROUTING -s 192.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 192.16.0.0/16 -d 192.16.2.0/24 -j RETURN
-A POSTROUTING ! -s 192.16.0.0/16 -d 192.16.0.0/16 -j MASQUERADE

上述四条规则作用分别是:

  • 集群内部pod间流量,不做NAT
  • 集群内部pod访问外部(非组播流量, 非集群内部pod),需要走NAT(snat)出去
  • 非集群内部pod流量访问集群本节点,不需要做NAT
  • 非集群内部pod流量访问集群内部pod流量(非本节点)走NAT(snat)

通过这么一个yaml文件,我们已经安装完了flannel,也熟悉安装完后,会生成的特定规则,接下来我们通过三个实验,详细介绍下flannel实现单节点容器间通信,容器访问外部网络通信,跨节点容器间通信具体实现机制。

单节点容器间通信方式(网桥)实验

  • 确定父接口, 我们的环境上用ens3
  • 创建两个隔离空间
[root@k8s-new-master ~]# ip netns add net1
[root@k8s-new-master ~]# ip netns add net2
  • 创建veth-pair 这样一对虚拟设备接口
[root@k8s-new-master cni]# ip link add veth_test_1 type veth peer name veth_test_2
[root@k8s-new-master cni]# ifconfig veth_test_1
veth_test_1: flags=4098  mtu 1500
        ether 92:0b:3c:57:44:91  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@k8s-new-master cni]# ifconfig veth_test_2
veth_test_2: flags=4098  mtu 1500
        ether b2:9e:68:8a:5a:7a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
  • 将子网口veth_test_1放进隔离空间net1里面,重命名网口名称为eth0,设置ip为192.168.88.1/24
[root@k8s-new-master cni]# ip link set veth_test_1 netns net1

[root@k8s-new-master cni]# ip netns exec net1 ip link
1: lo:  mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: veth_test_1@if10:  mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 92:0b:3c:57:44:91 brd ff:ff:ff:ff:ff:ff link-netnsid 0

[root@k8s-new-master cni]# ip netns exec net1 ip link set veth_test_1 name eth0

[root@k8s-new-master cni]# ip netns exec net1 ip link
1: lo:  mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11: eth0@if10:  mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 92:0b:3c:57:44:91 brd ff:ff:ff:ff:ff:ff link-netnsid 0

[root@k8s-new-master cni]# ip netns exec net1 ip addr add 192.168.88.1/24 dev eth0
[root@k8s-new-master cni]# ip netns exec net1 ip link set eth0 up

[root@k8s-new-master cni]# ip netns exec net1 ifconfig
eth0: flags=4099  mtu 1500
        inet 192.168.88.1  netmask 255.255.255.0  broadcast 0.0.0.0
        ether 92:0b:3c:57:44:91  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
  • 创建my_cni网桥,将对端veth_test_2桥接到网桥上
[root@k8s-new-master cni]# brctl addbr my_cni
[root@k8s-new-master cni]# brctl addif my_cni veth_test_2

[root@k8s-new-master cni]# brctl  show my_cni
bridge name	bridge id		STP enabled	interfaces
my_cni		8000.b29e688a5a7a	no		veth_test_2
  • 同理创建另外一对虚拟设备veth-pair
[root@k8s-new-master cni]# ip link add veth_test_3 type veth peer name veth_test_4
[root@k8s-new-master cni]# ip link set veth_test_3 netns net2
[root@k8s-new-master cni]# ip netns exec net2 ip link set veth_test_3 name eth0
[root@k8s-new-master cni]# ip netns exec net2 ip addr add 192.168.88.2/24 dev eth0
[root@k8s-new-master cni]# ip netns exec net2 ip link set eth0 up

[root@k8s-new-master cni]# brctl  show my_cni
bridge name	bridge id		STP enabled	interfaces
my_cni		8000.563c09befbc8	no		veth_test_2
										veth_test_4
  • 测试连通性
[root@k8s-new-master cni]# iptables -P FORWARD ACCEPT
[root@k8s-new-master cni]# ifconfig veth_test_2 up
[root@k8s-new-master cni]# ifconfig veth_test_4 up

[root@k8s-new-master cni]# ip netns exec net2 ping 192.168.88.1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=64 time=0.282 ms
64 bytes from 192.168.88.1: icmp_seq=2 ttl=64 time=0.095 ms
  • 网桥设置ip, 测试容器与网桥的连通性
[root@k8s-new-master cni]# ifconfig my_cni 192.168.88.10/24 up
[root@k8s-new-master cni]# ip netns exec net2 ping 192.168.88.10
PING 192.168.88.10 (192.168.88.10) 56(84) bytes of data.
64 bytes from 192.168.88.10: icmp_seq=1 ttl=64 time=0.161 ms
	
[root@k8s-new-master cni]# ip netns exec net1 ping 192.168.88.10
PING 192.168.88.10 (192.168.88.10) 56(84) bytes of data.
64 bytes from 192.168.88.10: icmp_seq=1 ttl=64 time=0.190 ms
  • 网桥接口与转发表,arp表
[root@k8s-new-master cni]# bridge link
10: veth_test_2 state UP @(null):  mtu 1500 master my_cni state forwarding priority 32 cost 2 
13: veth_test_4 state UP @(null):  mtu 1500 master my_cni state forwarding priority 32 cost 2 
	
[root@k8s-new-master cni]# bridge fdb |grep my_cni
b2:9e:68:8a:5a:7a dev veth_test_2 vlan 1 master my_cni permanent   # veth_test_2 mac地址
b2:9e:68:8a:5a:7a dev veth_test_2 master my_cni permanent
92:0b:3c:57:44:91 dev veth_test_2 master my_cni 		   # veth_test_1 现在在net1隔离空间里面的eth0的mac地址, 非permanent,无数据时,会老化
56:3c:09:be:fb:c8 dev veth_test_4 vlan 1 master my_cni permanent   # veth_test_4 mac地址
56:3c:09:be:fb:c8 dev veth_test_4 master my_cni permanent
e6:51:5f:c2:ad:4a dev veth_test_4 master my_cni 		   # veth_test_3 现在在net2隔离空间里面的eth0的mac地址, 非permanent,无数据时,会老化
33:33:00:00:00:01 dev my_cni self permanent
01:00:5e:00:00:01 dev my_cni self permanent
33:33:ff:be:fb:c8 dev my_cni self permanent

[root@k8s-new-master cni]# arp -i my_cni
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.88.2             ether   e6:51:5f:c2:ad:4a   C                     my_cni
192.168.88.1             ether   92:0b:3c:57:44:91   C                     my_cni

[root@k8s-new-master ~]# bridge monitor				  # 在老化之后,如果我们执行宿主节点访问隔离空间ip(192.168.88.2/192.168.88.1)就会触发网桥学习mac地址
e6:51:5f:c2:ad:4a dev veth_test_4 master my_cni 		  # 学习到对应的mac地址与dev:veth_test_4(类似交换机的port)的关系
92:0b:3c:57:44:91 dev veth_test_2 master my_cni

[root@k8s-new-master ~]# bridge -s fdb |grep my_cni			#查看收发包情况
92:0b:3c:57:44:91 dev veth_test_2 used 19/14 master my_cni 
b2:9e:68:8a:5a:7a dev veth_test_2 vlan 1 used 5838/5838 master my_cni permanent
b2:9e:68:8a:5a:7a dev veth_test_2 used 5838/5838 master my_cni permanent
56:3c:09:be:fb:c8 dev veth_test_4 vlan 1 used 5484/5484 master my_cni permanent
56:3c:09:be:fb:c8 dev veth_test_4 used 5484/5484 master my_cni permanent
e6:51:5f:c2:ad:4a dev veth_test_4 used 22/17 master my_cni 

容器访问外部网络通信方式实验

  • 上个实验的基础上,我们需要对访问外部网络时,做snat,否则私网ip出去后,也没法收到回复报文
[root@k8s-new-master cni]# ip netns exec net1 ping baidu.com
ping: baidu.com: Name or service not known
  • 通过添加192.168.88网段访问外网的nat规则,实现访问外网,我们知道,之前macvlan的时候,可以用类似下面的命令:
iptables -t nat -A POSTROUTING -s 192.168.8.0/24 ! -o cni0 -j MASQUERADE
  • 这里的实验,我们参考flannel在mode为vxlan时生成的iptables规则,使用以下方式:
[root@k8s-new-master cni]# ip netns exec net1 route add default gw 192.168.88.10
[root@k8s-new-master cni]# ip netns exec net2 route add default gw 192.168.88.10
[root@k8s-new-master cni]# iptables -t nat -A POSTROUTING -s 192.168.88.0/24 -d 192.168.88.0/24 -j RETURN
[root@k8s-new-master cni]# iptables -t nat -A POSTROUTING -s 192.168.88.0/24 ! -d 224.0.0.0/4 -j MASQUERADE
  • 测试访问外网的连通性
[root@k8s-new-master cni]# ip netns exec net1 ping baidu.com
PING baidu.com (220.181.38.148) 56(84) bytes of data.
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=1 ttl=45 time=39.0 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=2 ttl=45 time=38.5 ms
	
[root@k8s-new-master cni]# ip netns exec net2 ping baidu.com
PING baidu.com (39.156.69.79) 56(84) bytes of data.
64 bytes from 39.156.69.79 (39.156.69.79): icmp_seq=1 ttl=44 time=41.7 ms
64 bytes from 39.156.69.79 (39.156.69.79): icmp_seq=2 ttl=44 time=41.7 ms

跨节点容器间通信方式(vxlan)实验

上面的实验,我们已经在k8s-new-master(192.168.122.14)上创建一个网桥my_cni(192.168.88.10),创建两个隔离空间net1(内部eth0:192.168.88.1),net2(内部eth0:192.168.88.2)

为了测试跨节点通信,我们这里先在k8s-new-node1(192.168.122.15)上创建隔离空间net3(内部eth0:192.168.89.1),并创建相应的网关my_cni1(192.168.89.10)。步骤同上面,这里不注释

[root@k8s-new-node1 ~]# ip netns add net3
[root@k8s-new-node1 ~]# ip link add veth_test_5 type veth peer name veth_test_6
[root@k8s-new-node1 ~]# ip link set veth_test_5 netns net3
[root@k8s-new-node1 ~]# ip netns exec net3 ip link set veth_test_5 name eth0
[root@k8s-new-node1 ~]# ip netns exec net3 ip addr add 192.168.89.1/24 dev eth0
[root@k8s-new-node1 ~]# ip netns exec net3 ip link set eth0 up
[root@k8s-new-node1 ~]# brctl addbr my_cni1
[root@k8s-new-node1 ~]# brctl addif my_cni1 veth_test_6
[root@k8s-new-node1 ~]# ifconfig veth_test_6 up

[root@k8s-new-node1 ~]# brctl  show my_cni1
bridge name	bridge id		STP enabled	interfaces
my_cni1		8000.0290ebc217bb	no		veth_test_6

[root@k8s-new-node1 ~]# ifconfig my_cni1 192.168.89.10/24 up

[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.89.10
PING 192.168.89.10 (192.168.89.10) 56(84) bytes of data.
64 bytes from 192.168.89.10: icmp_seq=1 ttl=64 time=0.297 ms

[root@k8s-new-node1 ~]# ip netns exec net3 route add default gw 192.168.89.10
[root@k8s-new-node1 ~]# iptables -P FORWARD ACCEPT
  • 接下来我们参考flannel创建的flannel.1,在k8s-new-master创建vtep设备my_vtep0
[root@k8s-new-master ~]# ip link add my_vtep0 type vxlan id 200 dstport 4789 local 192.168.122.14 dev ens3 nolearning # 自动创建的mac地址a6:d3:23:dd:03:6f
[root@k8s-new-master ~]# ip link set my_vtep0 up
[root@k8s-new-master ~]# ip addr add 192.168.88.0/32 dev my_vtep0
[root@k8s-new-master ~]# ip route add 192.168.89.0/24 via 192.168.89.0 dev my_vtep0 onlink
[root@k8s-new-master ~]# ip neigh add 192.168.89.0 lladdr b2:07:fc:b6:82:a7 dev my_vtep0			     # 这一步需要等node1创建完my_vtep1后再填入mac地址
[root@k8s-new-master ~]# bridge fdb append b2:07:fc:b6:82:a7 dev my_vtep0 dst 192.168.122.15			     # 这一步需要等node1创建完my_vtep1后再填入mac地址

[root@k8s-new-master cni]# ip -d link show
18: my_vtep0:  mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether a6:d3:23:dd:03:6f brd ff:ff:ff:ff:ff:ff promiscuity 0 
    vxlan id 200 local 192.168.122.14 dev ens3 srcport 0 0 dstport 4789 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

部分参数说明:

  • 第1条命令:
    id: VNI标识是200
    local: vxlan隧道使用的源ip
    dstport: 指定目的端口为4789。
    dev: 指定VTEP通过哪个物理device来通信,这里是使用eth0。
    之后主要是up设备, 设置ip地址,添加跨节点、跨网段路由,添加bridge转发表

  • 第2/3条命令分别用于up 设备及设置设备ip

  • 第4条命令指定到对端(node1)节点容器网络192.168.89.0/24使用的网关192.168.89.0,以及使用的设备my_vtep0

  • 第5条命令添加一个arp表项目,包括对端的vtep ip及mac

  • 第6条命令添加提条对端vtep mac地址的转发表,通过对方的公网ip 192.168.122.15

  • 在k8s-new-node1创建vtep设备my_vtep1,步骤同上

[root@k8s-new-node1 ~]# ip link add my_vtep1 type vxlan id 200 dstport 4789 local 192.168.122.15 dev ens3 nolearning  # 自动创建的mac地址b2:07:fc:b6:82:a7
[root@k8s-new-node1 ~]# ip link set my_vtep1 up
[root@k8s-new-node1 ~]# ip addr add 192.168.89.0/32 dev my_vtep1
[root@k8s-new-node1 ~]# ip route add 192.168.88.0/24 via 192.168.88.0 dev my_vtep1 onlink
[root@k8s-new-node1 ~]# ip neigh add 192.168.88.0 lladdr a6:d3:23:dd:03:6f dev my_vtep1				      # 这一步需要等master创建完my_vtep0后再填入mac地址
[root@k8s-new-node1 ~]# bridge fdb append a6:d3:23:dd:03:6f dev my_vtep1 dst 192.168.122.14			      # 这一步需要等master创建完my_vtep0后再填入mac地址

[root@k8s-new-node1 ~]# ip -d link show
16: my_vtep1:  mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether b2:07:fc:b6:82:a7 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    vxlan id 200 local 192.168.122.15 dev ens3 srcport 0 0 dstport 4789 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
  • 测试连通性
[root@k8s-new-master flannel]# ip netns exec net1 ping 192.168.89.1		# master上net1到node1的net3
PING 192.168.89.1 (192.168.89.1) 56(84) bytes of data.
64 bytes from 192.168.89.1: icmp_seq=1 ttl=62 time=1.18 ms

[root@k8s-new-node1 ~]# tcpdump -i ens3 -nnev port 4789				# master上net1到node1的net3,在node1上抓包
tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
18:24:03.775317 88:4f:d5:25:80:12 > 88:4f:d5:25:80:13, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 53734, offset 0, flags [none], proto UDP (17), length 134)
    192.168.122.14.37379 > 192.168.122.15.4789: VXLAN, flags [I] (0x08), vni 200
a6:d3:23:dd:03:6f > b2:07:fc:b6:82:a7, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 752, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.88.0 > 192.168.89.1: ICMP echo request, id 29566, seq 3, length 64
18:24:03.775542 88:4f:d5:25:80:13 > 88:4f:d5:25:80:12, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 40978, offset 0, flags [none], proto UDP (17), length 134)
    192.168.122.15.55264 > 192.168.122.14.4789: VXLAN, flags [I] (0x08), vni 200
b2:07:fc:b6:82:a7 > a6:d3:23:dd:03:6f, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51787, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.89.1 > 192.168.88.0: ICMP echo reply, id 29566, seq 3, length 64

	
[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.88.1			# node1的net3到master的net1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=62 time=1.27 ms
	
[root@k8s-new-node1 ~]# ip netns exec net3 ping 192.168.88.2			# node1的net3到master的net2
PING 192.168.88.2 (192.168.88.2) 56(84) bytes of data.
64 bytes from 192.168.88.2: icmp_seq=1 ttl=62 time=1.08 ms

[root@k8s-new-master ~]# ping 192.168.89.1					# master到node1的net3
PING 192.168.89.1 (192.168.89.1) 56(84) bytes of data.
64 bytes from 192.168.89.1: icmp_seq=1 ttl=63 time=4.45 ms

[root@k8s-new-node1 ~]# ping 192.168.88.1					# node1到master的net1
PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
64 bytes from 192.168.88.1: icmp_seq=1 ttl=63 time=0.969 ms

flannel常用backend(后端)介绍

flannel常用后端包括udp、vxlan、host-gw等, 要对比这些后端的优缺点,需要简单的了解下它们的实现:它们都是应用在跨节点容器间通信。

udp

首先是udp封装:简单理解就是,将三层的ip报文封装在一个udp报文中。其中三层的两个ip分别在不同节点的容器上。数据流程:

节点1的pod A(容器)->cni0->flannel0(tun设备)->flanneld:8285->eth0(节点1的公网ip所在网卡)->internel
	-> eth1(节点2的公网ip所在网卡)->flanneld:8225->flannel0(tun设备)->cni0->pod B(容器)

vxlan

接下来是vxlan,这个方案上面已经介绍了,vxlan将二层报文封装在udp里面。二层报文的ip层的两个ip也是在不同节点的容器上。数据流程:

节点1的pod A(容器)->cni0->flannel.1(VTEP Virtual Tunnel End Point设备)->eth0(节点1的公网ip所在网卡)->internel
	-> eth1(节点2的公网ip所在网卡)->flannel.1->cni0->pod B(容器)

host-gw

最后是host-gw, 这个方案通过增加路由来将报文转发到对应的节点上。数据流程:

节点1的pod A(容器)->cni0->eth0(节点1的公网ip所在网卡)->internel
	-> eth1(节点2的公网ip所在网卡)->cni0->pod B(容器)

优缺点

  • 性能上来说:host-gw>vxlan>udp host-gw再报文转发上,不需要增加额外的开销,vxlan与udp都需要多一层封装,由于udp封装过程涉及多次用户态与内核态的切换,性能损耗相对vxlan更大。
  • 适用场景:vxlan=udp>host-gw vxlan与udp一样只需要节点间三层可达即可,可以理解为三层可达后,就可以跨节点传输的udp报文,host-gw需要将对方节点作为跨节点转发的下一跳(目的mac地址),需要二层可达。

综上,目前vxlan上实际上已经完全可以替换掉udp模式,所以本文后面讲只介绍vxlan及host-gw,由于当前实际使用场景,我们既希望性能上高,也希望在二层不可达的时候使用vxlan, 所以在vxlan模式中,实际上有个DirectRouting选项,开启后,如果二层可达,自动使用host-gw,否则使用vxlan进行跨节点容器间通信。

所以介绍vxlan的实现时,外加DirectRouting选项,就覆盖了flannel的常用后端。

flannel 插件代码实现

主流程

func main() {
	//第一步是确认网卡
	if opts.version {				//输出版本信息
		fmt.Fprintln(os.Stderr, version.Version)
		os.Exit(0)
	}

	flagutil.SetFlagsFromEnv(flannelFlags, "FLANNELD")

	// Validate flags
	if opts.subnetLeaseRenewMargin >= 24*60 || opts.subnetLeaseRenewMargin <= 0 {		//参数检查,, 子网续约时间不能大于1天 单位是分钟
		log.Error("Invalid subnet-lease-renew-margin option, out of acceptable range")
		os.Exit(1)
	}

	// Work out which interface to use
	var extIface *backend.ExternalInterface
	var err error
	// Check the default interface only if no interfaces are specified
	if len(opts.iface) == 0 && len(opts.ifaceRegex) == 0 {			//没有指定网卡,则自己查找
		extIface, err = LookupExtIface(opts.publicIP, "")		//查找网卡,这里想要知道详细的可以查阅香缎源码
		if err != nil {
			log.Error("Failed to find any valid interface to use: ", err)
			os.Exit(1)
		}
	} else {
		// Check explicitly specified interfaces			//有指定网卡,使用对应名称的网卡
		for _, iface := range opts.iface {
			extIface, err = LookupExtIface(iface, "")
			if err != nil {
				log.Infof("Could not find valid interface matching %s: %s", iface, err)
			}

			if extIface != nil {
				break
			}
		}

		// Check interfaces that match any specified regexes
		if extIface == nil {						//用户通过正则表达式指定网卡名称
			for _, ifaceRegex := range opts.ifaceRegex {
				extIface, err = LookupExtIface("", ifaceRegex)
				if err != nil {
					log.Infof("Could not find valid interface matching %s: %s", ifaceRegex, err)
				}

				if extIface != nil {
					break
				}
			}
		}

		if extIface == nil {						//没有找到合适的网卡 直接退出
			// Exit if any of the specified interfaces do not match
			log.Error("Failed to find interface to use that matches the interfaces and/or regexes provided")
			os.Exit(1)
		}
	}

	//第二步创建子网管理对象, 用于持久化存储功能,存储管理方式目前支持: kubernetes api-server或者etcd
	sm, err := newSubnetManager()			//创建子网管理对象
	if err != nil {
		log.Error("Failed to create SubnetManager: ", err)
		os.Exit(1)
	}
	log.Infof("Created subnet manager: %s", sm.Name())

	// Register for SIGINT and SIGTERM
	log.Info("Installing signal handlers")
	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, os.Interrupt, syscall.SIGTERM)

	// This is the main context that everything should run in.
	// All spawned goroutines should exit when cancel is called on this context.
	// Go routines spawned from main.go coordinate using a WaitGroup. This provides a mechanism to allow the shutdownHandler goroutine
	// to block until all the goroutines return . If those goroutines spawn other goroutines then they are responsible for
	// blocking and returning only when cancel() is called.
	ctx, cancel := context.WithCancel(context.Background())		//创建一个可以被cancel的ctx
	wg := sync.WaitGroup{}

	wg.Add(1)
	go func() {
		shutdownHandler(ctx, sigs, cancel)
		wg.Done()
	}()

	if opts.healthzPort > 0 {
		// It's not super easy to shutdown the HTTP server so don't attempt to stop it cleanly
		go mustRunHealthz()
	}

	//第三步: 创建网卡并且激活
	// Fetch the network config (i.e. what backend to use etc..).
	config, err := getConfig(ctx, sm)
	if err == errCanceled {
		wg.Wait()
		os.Exit(0)
	}

	// Create a backend manager then use it to create the backend and register the network with it.
	bm := backend.NewManager(ctx, sm, extIface)		//创建后端的manager对象
	be, err := bm.GetBackend(config.BackendType)
	if err != nil {
		log.Errorf("Error fetching backend: %s", err)
		cancel()
		wg.Wait()
		os.Exit(1)
	}

	bn, err := be.RegisterNetwork(ctx, wg, config)		//执行后端的注册网络函数,对应vxlan就是vxlan.go文件中RegisterNetwork,后面会详细介绍vxlan及host-gw后端
	if err != nil {
		log.Errorf("Error registering network: %s", err)
		cancel()
		wg.Wait()
		os.Exit(1)
	}

	// Set up ipMasq if needed
	if opts.ipMasq {				//根据配置,判断是否需要开启ip-masquerade
		if err = recycleIPTables(config.Network, bn.Lease()); err != nil {
			log.Errorf("Failed to recycle IPTables rules, %v", err)
			cancel()
			wg.Wait()
			os.Exit(1)
		}
		log.Infof("Setting up masking rules")
		go network.SetupAndEnsureIPTables(network.MasqRules(config.Network, bn.Lease()), opts.iptablesResyncSeconds)	//创建iptables策略(ip-masquerade)
	}

	// Always enables forwarding rules. This is needed for Docker versions >1.13 (https://docs.docker.com/engine/userguide/networking/default_network/container-communication/#container-communication-between-hosts)
	// In Docker 1.12 and earlier, the default FORWARD chain policy was ACCEPT.
	// In Docker 1.13 and later, Docker sets the default policy of the FORWARD chain to DROP.
	if opts.iptablesForwardRules {				//设置转发策略
		log.Infof("Changing default FORWARD chain policy to ACCEPT")
		go network.SetupAndEnsureIPTables(network.ForwardRules(config.Network.String()), opts.iptablesResyncSeconds)
	}

	if err := WriteSubnetFile(opts.subnetFile, config.Network, opts.ipMasq, bn); err != nil {		//保存到配置文件中
		// Continue, even though it failed.
		log.Warningf("Failed to write subnet file: %s", err)
	} else {
		log.Infof("Wrote subnet file to %s", opts.subnetFile)
	}

	// Start "Running" the backend network. This will block until the context is done so run in another goroutine.
	log.Info("Running backend.")
	wg.Add(1)
	go func() {
		bn.Run(ctx)			//如果是vxlan网络 执行的是vxlan_network.go中Run
		wg.Done()
	}()

	daemon.SdNotify(false, "READY=1")

	//第四步: 启动监控
	// Kube subnet mgr doesn't lease the subnet for this node - it just uses the podCidr that's already assigned.
	if !opts.kubeSubnetMgr {
		//通过etcd管理网络 会进入此函数 此函数是一个死循环
		err = MonitorLease(ctx, sm, bn, &wg)			//监控该节点 主要用于节点租约过期后 能够快速获取新的租约
		if err == errInterrupted {
			// The lease was "revoked" - shut everything down
			cancel()
		}
	}

	log.Info("Waiting for all goroutines to exit")
	// Block waiting for all the goroutines to finish.
	wg.Wait()
	log.Info("Exiting cleanly...")
	os.Exit(0)
}

上面的流程可以简单理解为:第一步是确认网卡,第二步创建子网管理对象,第三步创建网卡并且激活,第四步启动监控
第三步创建网卡并激活,第四部启动监控都会根据配置的后端实现会有不同,下面会介绍。

vxlan backend

这里我们主要介绍VTEP虚拟网卡的创建及监听子网添加删除(集群节点添加删除)事件

创建网卡

vxlan backend go语言实现参考链接

我们这里顺着主函数注册网络,简单介绍vxlan后端的实现。

注册网络:
输入参数: 上下文ctx, 子网信息config
输出参数: backend.network后端(vxlan)网络

func (be *VXLANBackend) RegisterNetwork(ctx context.Context, wg sync.WaitGroup, config *subnet.Config) (backend.Network, error) {
	// Parse our configuration
	cfg := struct {
		VNI           int
		Port          int
		GBP           bool
		Learning      bool
		DirectRouting bool
	}{
		VNI: defaultVNI,
	}

	if len(config.Backend) > 0 {  			//解析配置
		if err := json.Unmarshal(config.Backend, &cfg); err != nil {
			return nil, fmt.Errorf("error decoding VXLAN backend config: %v", err)
		}
	}
	log.Infof("VXLAN config: VNI=%d Port=%d GBP=%v Learning=%v DirectRouting=%v", cfg.VNI, cfg.Port, cfg.GBP, cfg.Learning, cfg.DirectRouting)

	devAttrs := vxlanDeviceAttrs{					//VXLAN设备属性
		vni:       uint32(cfg.VNI),
		name:      fmt.Sprintf("flannel.%v", cfg.VNI),
		vtepIndex: be.extIface.Iface.Index,
		vtepAddr:  be.extIface.IfaceAddr,
		vtepPort:  cfg.Port,
		gbp:       cfg.GBP,
		learning:  cfg.Learning,
	}

	dev, err := newVXLANDevice(&devAttrs)			//创建VXLAN设备
	if err != nil {
		return nil, err
	}
	dev.directRouting = cfg.DirectRouting

	subnetAttrs, err := newSubnetAttrs(be.extIface.ExtAddr, dev.MACAddr())			//创建子网属性
	if err != nil {
		return nil, err
	}

	lease, err := be.subnetMgr.AcquireLease(ctx, subnetAttrs)			//获取租约获取租约
	switch err {
	case nil:
	case context.Canceled, context.DeadlineExceeded:
		return nil, err
	default:
		return nil, fmt.Errorf("failed to acquire lease: %v", err)
	}

	// Ensure that the device has a /32 address so that no broadcast routes are created.
	// This IP is just used as a source address for host to workload traffic (so
	// the return path for the traffic has an address on the flannel network to use as the destination)
	if err := dev.Configure(ip.IP4Net{IP: lease.Subnet.IP, PrefixLen: 32}); err != nil {		//设置ip并up起来
		return nil, fmt.Errorf("failed to configure interface %s: %s", dev.link.Attrs().Name, err)
	}

	return newNetwork(be.subnetMgr, be.extIface, dev, ip.IP4Net{}, lease)			//new Network结构体
}

简单来说就是创建VXLAN设备、获取租约信息、vxlan配置ip、返回对象。

接下来简单说下创建VXLAN设备、获取租约信息的流程。

首先是创建VXLAN设备:
输入参数: devAttrs 设备属性
输出参数:返回vxlanDevice对象

func newVXLANDevice(devAttrs *vxlanDeviceAttrs) (*vxlanDevice, error) {
	link := &netlink.Vxlan{
		LinkAttrs: netlink.LinkAttrs{
			Name: devAttrs.name,
		},
		VxlanId:      int(devAttrs.vni),
		VtepDevIndex: devAttrs.vtepIndex,
		SrcAddr:      devAttrs.vtepAddr,
		Port:         devAttrs.vtepPort,
		Learning:     devAttrs.learning,
		GBP:          devAttrs.gbp,
	}

	link, err := ensureLink(link)			//创建VXLAN设备
	if err != nil {
		return nil, err
	}

	_, _ = sysctl.Sysctl(fmt.Sprintf("net/ipv6/conf/%s/accept_ra", devAttrs.name), "0")

	return &vxlanDevice{
		link: link,
	}, nil
}

func ensureLink(vxlan *netlink.Vxlan) (*netlink.Vxlan, error) {
	err := netlink.LinkAdd(vxlan)
	if err == syscall.EEXIST {
		// it's ok if the device already exists as long as config is similar
		log.V(1).Infof("VXLAN device already exists")
		existing, err := netlink.LinkByName(vxlan.Name) 		//获取已有vxlan设备信息
		if err != nil {
			return nil, err
		}

		incompat := vxlanLinksIncompat(vxlan, existing)			//比较新旧网卡信息
		if incompat == "" {
			log.V(1).Infof("Returning existing device")
			return existing.(*netlink.Vxlan), nil
		}

		// delete existing
		log.Warningf("%q already exists with incompatable configuration: %v; recreating device", vxlan.Name, incompat)		//不相同则删除
		if err = netlink.LinkDel(existing); err != nil {
			return nil, fmt.Errorf("failed to delete interface: %v", err)
		}

		// create new
		if err = netlink.LinkAdd(vxlan); err != nil {			//创建新的vxlan设备
			return nil, fmt.Errorf("failed to create vxlan interface: %v", err)
		}
	} else if err != nil {
		return nil, err
	}

	ifindex := vxlan.Index
	link, err := netlink.LinkByIndex(vxlan.Index)		//根据索引进行查找设备
	if err != nil {
		return nil, fmt.Errorf("can't locate created vxlan device with index %v", ifindex)
	}

	var ok bool
	if vxlan, ok = link.(*netlink.Vxlan); !ok {
		return nil, fmt.Errorf("created vxlan device with index %v is not vxlan", ifindex)
	}

	return vxlan, nil
}

上面通过第三方库netlink.LinkAdd函数创建vxlan设备,感兴趣请查阅相关代码

然后是获取租约信息:
输入参数: 上下文ctx, 属性信息
输出参数: 租约信息
有两个地方有该代码etcdv2/local_manager.go/kube/kube.go, 这里贴etcdv2的代码

func (m *LocalManager) AcquireLease(ctx context.Context, attrs *LeaseAttrs) (*Lease, error) {
	config, err := m.GetNetworkConfig(ctx)			//获取配置信息, 向etcd查询相关信息
	if err != nil {
		return nil, err
	}

	for i := 0; i < raceRetries; i++ {
		l, err := m.tryAcquireLease(ctx, config, attrs.PublicIP, attrs)
		switch err {
		case nil:
			return l, nil
		case errTryAgain:
			continue
		default:
			return nil, err
		}
	}

	return nil, errors.New("Max retries reached trying to acquire a subnet")
}

//输入参数:上下文ctx, config配置信息, 外部ip, 租约信息属性
//输出参数:返回租约对象
func (m *LocalManager) tryAcquireLease(ctx context.Context, config *Config, extIaddr ip.IP4, attrs *LeaseAttrs) (*Lease, error) {
	leases, _, err := m.registry.getSubnets(ctx)
	if err != nil {
		return nil, err
	}

	// Try to reuse a subnet if there's one that matches our IP
	if l := findLeaseByIP(leases, extIaddr); l != nil {			//向etcd查找是否已经存在租约信息
		// Make sure the existing subnet is still within the configured network
		if isSubnetConfigCompat(config, l.Subnet) {
			log.Infof("Found lease (%v) for current IP (%v), reusing", l.Subnet, extIaddr)

			ttl := time.Duration(0)
			if !l.Expiration.IsZero() {
				// Not a reservation
				ttl = subnetTTL
			}
			exp, err := m.registry.updateSubnet(ctx, l.Subnet, attrs, ttl, 0)	//更新子网
			if err != nil {
				return nil, err
			}

			l.Attrs = *attrs
			l.Expiration = exp
			return l, nil
		} else {
			log.Infof("Found lease (%v) for current IP (%v) but not compatible with current config, deleting", l.Subnet, extIaddr)
			if err := m.registry.deleteSubnet(ctx, l.Subnet); err != nil {		//删除已有子网
				return nil, err
			}
		}
	}

	// no existing match, check if there was a previous subnet to use
	var sn ip.IP4Net
	if !m.previousSubnet.Empty() {
		// use previous subnet					//逻辑同上,使用/run/flannel/subnet.env
		if l := findLeaseBySubnet(leases, m.previousSubnet); l != nil {
			// Make sure the existing subnet is still within the configured network
			if isSubnetConfigCompat(config, l.Subnet) {
				log.Infof("Found lease (%v) matching previously leased subnet, reusing", l.Subnet)

				ttl := time.Duration(0)
				if !l.Expiration.IsZero() {
					// Not a reservation
					ttl = subnetTTL
				}
				exp, err := m.registry.updateSubnet(ctx, l.Subnet, attrs, ttl, 0)
				if err != nil {
					return nil, err
				}

				l.Attrs = *attrs
				l.Expiration = exp
				return l, nil
			} else {
				log.Infof("Found lease (%v) matching previously leased subnet but not compatible with current config, deleting", l.Subnet)
				if err := m.registry.deleteSubnet(ctx, l.Subnet); err != nil {
					return nil, err
				}
			}
		} else {
			// Check if the previous subnet is a part of the network and of the right subnet length
			if isSubnetConfigCompat(config, m.previousSubnet) {
				log.Infof("Found previously leased subnet (%v), reusing", m.previousSubnet)
				sn = m.previousSubnet
			} else {
				log.Errorf("Found previously leased subnet (%v) that is not compatible with the Etcd network config, ignoring", m.previousSubnet)
			}
		}
	}

	if sn.Empty() {				//以上两种查询都没有满足
		// no existing match, grab a new one
		sn, err = m.allocateSubnet(config, leases)		//创建一个新的子网
		if err != nil {
			return nil, err
		}
	}

	exp, err := m.registry.createSubnet(ctx, sn, attrs, subnetTTL)		//向etcd存储信息 存活时间是24h 这样etcd中就有subnets信息
	switch {
	case err == nil:
		log.Infof("Allocated lease (%v) to current node (%v) ", sn, extIaddr)
		return &Lease{
			Subnet:     sn,
			Attrs:      *attrs,
			Expiration: exp,
		}, nil
	case isErrEtcdNodeExist(err):
		return nil, errTryAgain
	default:
		return nil, err
	}
}

简单来说,就是根据子网信息,向etcd或者/run/flannel/subnet.env查询是否存在,不存在则注册相应的子网信息到etcd。

启动监控

func (nw *network) Run(ctx context.Context) {
	wg := sync.WaitGroup{}

	log.V(0).Info("watching for new subnet leases")
	events := make(chan []subnet.Event)
	wg.Add(1)
	go func() {
		subnet.WatchLeases(ctx, nw.subnetMgr, nw.SubnetLease, events)	//对所有租约进行监控 调用watch.go 中WatchLeases函数
										// WatchLeases->watchSubnets
		log.V(1).Info("WatchLeases exited")				//获取etcd数据,阻塞方式
		wg.Done()
	}()

	defer wg.Wait()

	for {		//死循环处理所有事件
		select {
		case evtBatch := <-events:
			nw.handleSubnetEvents(evtBatch)		 //有事件发生执行处理函数

		case <-ctx.Done():
			return
		}
	}
}

func (nw *network) handleSubnetEvents(batch []subnet.Event) {
	for _, event := range batch {
		sn := event.Lease.Subnet
		attrs := event.Lease.Attrs
		if attrs.BackendType != "vxlan" {
			log.Warningf("ignoring non-vxlan subnet(%s): type=%v", sn, attrs.BackendType)
			continue
		}

		var vxlanAttrs vxlanLeaseAttrs
		if err := json.Unmarshal(attrs.BackendData, &vxlanAttrs); err != nil {		//解析json格式数据,从etcd获取的返回值
			log.Error("error decoding subnet lease JSON: ", err)
			continue
		}

		// This route is used when traffic should be vxlan encapsulated
		vxlanRoute := netlink.Route{							//跨节点容器间通信:vxlan封装的路由表
			LinkIndex: nw.dev.link.Attrs().Index,
			Scope:     netlink.SCOPE_UNIVERSE,
			Dst:       sn.ToIPNet(),
			Gw:        sn.IP.ToIP(),
		}
		vxlanRoute.SetFlag(syscall.RTNH_F_ONLINK)

		// directRouting is where the remote host is on the same subnet so vxlan isn't required.	//跨节点容器间通信(vxlan+directrouting=host-gw):如果有开启directrouting并且节点间同子网,则直接路由
		directRoute := netlink.Route{
			Dst: sn.ToIPNet(),
			Gw:  attrs.PublicIP.ToIP(),
		}
		var directRoutingOK = false
		if nw.dev.directRouting {
			if dr, err := ip.DirectRouting(attrs.PublicIP.ToIP()); err != nil {
				log.Error(err)
			} else {
				directRoutingOK = dr
			}
		}

		switch event.Type {
		case subnet.EventAdded:
			if directRoutingOK {			//直接路由方式(vxlan+directrouting = host-gw): 只要添加路由表
				log.V(2).Infof("Adding direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)

				if err := netlink.RouteReplace(&directRoute); err != nil {
					log.Errorf("Error adding route to %v via %v: %v", sn, attrs.PublicIP, err)
					continue
				}
			} else {				//vxlan方式: 添加arp表,添加fdb转发表,更新路由表
				log.V(2).Infof("adding subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
				if err := nw.dev.AddARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {	//添加arp表
					log.Error("AddARP failed: ", err)
					continue
				}

				if err := nw.dev.AddFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {	//添加fdb转发表
					log.Error("AddFDB failed: ", err)

					// Try to clean up the ARP entry then continue
					if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
						log.Error("DelARP failed: ", err)
					}

					continue
				}

				// Set the route - the kernel would ARP for the Gw IP address if it hadn't already been set above so make sure
				// this is done last.
				if err := netlink.RouteReplace(&vxlanRoute); err != nil {			//更新路由表
					log.Errorf("failed to add vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)

					// Try to clean up both the ARP and FDB entries then continue
					if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
						log.Error("DelARP failed: ", err)
					}

					if err := nw.dev.DelFDB(neighbor{IP: event.Lease.Attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
						log.Error("DelFDB failed: ", err)
					}

					continue
				}
			}
		case subnet.EventRemoved:
			if directRoutingOK {
				log.V(2).Infof("Removing direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
				if err := netlink.RouteDel(&directRoute); err != nil {		//直接路由方式(vxlan+directrouting = host-gw): 只要删除路由表
					log.Errorf("Error deleting route to %v via %v: %v", sn, attrs.PublicIP, err)
				}
			} else {
				log.V(2).Infof("removing subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))

				// Try to remove all entries - don't bail out if one of them fails.	//vxlan方式: 删除arp表,删除fdb转发表,删除路由表
				if err := nw.dev.DelARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
					log.Error("DelARP failed: ", err)
				}

				if err := nw.dev.DelFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
					log.Error("DelFDB failed: ", err)
				}

				if err := netlink.RouteDel(&vxlanRoute); err != nil {
					log.Errorf("failed to delete vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
				}
			}
		default:
			log.Error("internal error: unknown event type: ", int(event.Type))
		}
	}
}

host-gw 代码实现

原理同vxlan+directRouting这里不介绍

vxlan之DirectRouting配置(原理同host-gw)

配置方式

net-conf.json: |
{
  "Network": "192.16.0.0/16",
  "Backend": {
    "Type": "vxlan",
    "DirectRouting": true
  }
}

如果集群节点间属于同网段网络,那么它们是二层可达,此时这两个节点上的跨节点容器网络间通信会自动采用host-gw的方式,也就是直接路由。
以下是master节点生成的路由,其它节点同理:

[root@k8s-new-master flannel]# ip route |grep "192.16\."
192.16.0.0/24 dev cni0 proto kernel scope link src 192.16.0.1
192.16.1.0/24 via 192.168.122.15 dev ens3
192.16.2.0/24 via 192.168.122.16 dev ens3

我们这里通过抓包看下报文的格式,以下是node1上的pod容器ip 192.16.1.58到node2上的pod容器192.16.2.19的抓包,抓包在node2的公网口上:

[root@k8s-new-node2 ~]# tcpdump -i ens3 host 192.16.1.58 -nnev
19:54:33.590095 88:4f:d5:25:80:13 > 88:4f:d5:25:80:14, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 38285, offset 0, flags [DF], proto ICMP (1), length 84)
    192.16.1.58 > 192.16.2.19: ICMP echo request, id 28140, seq 1, length 64


19:54:33.590364 88:4f:d5:25:80:14 > 88:4f:d5:25:80:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 12163, offset 0, flags [none], proto ICMP (1), length 84)
    192.16.2.19 > 192.16.1.58: ICMP echo reply, id 28140, seq 1, length 64

[root@k8s-new-node2 ~]# ifconfig ens3
ens3: flags=4163  mtu 1500
        inet 192.168.122.16  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 88:4f:d5:25:80:14  txqueuelen 1000  (Ethernet)

[root@k8s-new-node1 ns_tools]# ifconfig ens3
ens3: flags=4163  mtu 1500
        inet 192.168.122.15  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 88:4f:d5:25:80:13  txqueuelen 1000  (Ethernet)

显然报文是直接替换目的mac地址,ip层的ip没有变动。所以没有走vxlan通道,而是走了host-gw的直接路由

flannel插件配合其它插件实现网络策略(Network Policy)

后续会有完整的network policy介绍,这里链接待添加

flannel插件总结与讨论

后续如果有calico等插件,再新增,相对maxvlan的已经在本文开头对比说明。

你可能感兴趣的:(kubernetes,docker)