Calico BGP通信分析

Calico BGP通信分析

Calico BGP通信分析_第1张图片

BGP网络模型

  • BGP网络相比较IPIP网络,最大的不同之处就是没有隧道设备tunl0,pod之间的流量直接从·宿主机通过arp下一跳到目的地宿主机,减少了tunl0环节

BGP两种模式:

  • 全互联模式(node-to-node mesh)——全互联模式,每一个BGP Speaker都需要和其他BGP Speaker建立BGP连接,这样BGP连接总数就是N^2,如果数量过大会消耗大量连接。如果集群数量超过100台官方不建议使用此种模式
  • 路由反射模式Router Reflection(RR)——RR模式中会指定一个或多个BGP Speaker为RouterReflection,它与网络中其他Speaker建立连接,每个Speaker只要与Router Reflection建立BGP就可以获得全网的路由信息。在calico中可以通过Global Peer实现RR模式

BGP开启方式

# 开启IP In IP 模式方式:设置环境变量CALICO_IPV4POOL_IPIP来标识是否开启IPinIP Mode. 如果该变量的值为Always那么就是开启IPIP,如果关闭需要设置为Never
- name: CALICO_IPV4POOL_IPIP
  value: "Never"

测试容器YAML

主机 IP
k8s-master-1 192.168.0.11/24
K8s-node-1 192.168.0.12/24
apiVersion: v1
kind: Service
metadata:
  name: busybox
  namespace: devops
spec:
  selector:
    app: busybox
  type: NodePort
  ports:
  - name: http
    port: 8888
    protocol: TCP
    targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: devops
spec:
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: busybox
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      name: busybox
      labels:
        app: busybox
    spec:
      affinity:	# 防止二个busybox 在同一个节点
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - busybox
      restartPolicy: Always
      containers:
      - command: ["/bin/sh","-c","mkdir -p /var/lib/www && httpd -f -v -p 80 -h /var/lib/www"]
        name: busybox
        image: docker.io/library/busybox:latest
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 80

BGP跨主机分析

  1. 查看Pod信息
╰─ kubectl get pods -n devops -o custom-columns=NAME:.metadata.name,IP:.status.podIP,HOST:.spec.nodeName
NAME                       IP              HOST
busybox-77649b9c55-fv298   172.16.196.1    k8s-master-1
busybox-77649b9c55-s7zfv   172.16.109.65   k8s-node-1
jenkins-56b6774bb6-d8v8b   172.16.109.66   k8s-node-1
  1. 进入k8s-master-1的容器busybox-77649b9c55-fv298查看路由信息
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
    link/ether 0e:1c:2c:f6:2a:f9 brd ff:ff:ff:ff:ff:ff
    inet 172.16.196.1/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c1c:2cff:fef6:2af9/64 scope link
       valid_lft forever preferred_lft forever

/ # route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
169.254.1.1     0.0.0.0         255.255.255.255 UH    0      0        0 eth0

从上述中我们可以看出,k8s-master-1的容器busybox-77649b9c55-fv298默认有一个网关:169.254.1.1。但是整个网络中没有一张网卡是这个地址

  • 从路由表可以知道 169.254.1.1 是容器的默认网关,但却找不到任何一张网卡对应这个 IP 地址。当一个数据包的目的地址不是本机时,就会查询路由表,从路由表中查到网关后,它首先会通过 ARP广播获得网关的 MAC 地址,然后在发出的网络数据包中将目标 MAC 改为网关的 MAC,而网关的 IP 地址不会出现在任何网络包头中。也就是说,没有人在乎这个 IP 地址究竟是什么,只要能找到对应的 MAC 地址,能响应 ARP 就行了

  • 在Kubernetes Calico网络中,当一个数据包的目的地址不是本网络时,会先发起ARP广播,网关即169.254.1.1收到会将自己的mac地址返回给发送端
    后续的请求由这个veth对进行完成,使用代理arp做了arp欺骗。这样做抑制了arp广播攻击,并且通过代理arp也可以进行跨网络的访问

  • 查看MAC地址信息,这个 MAC 地址应该是 Calico 硬塞进去的,而且还能响应 ARP。正常情况下,内核会对外发送 ARP 请求,询问整个二层网络中谁拥有 169.254.1.1 这个 IP 地址,拥有这个 IP 地址的设备会将自己的 MAC地址返回给对方。但现在的情况比较尴尬,容器和主机都没有这个 IP 地址,甚至连主机上的网卡:calixxxxx,。MAC 地址也是一个无用的 ee:ee:ee:ee:ee:ee

  1. k8s-master-1宿主机节点网卡信息
/ # ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee used 0/0/0 probes 1 STALE

# k8s-master-1 自身网卡信息查看
[root@k8s-master-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:b1:02:f0 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 192.168.0.11/24 brd 192.168.0.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:feb1:2f0/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether c2:25:66:ef:00:10 brd ff:ff:ff:ff:ff:ff
    inet 10.96.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.6.244/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.0.10/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.35.201/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
5: cali1dadcdd5b31@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default			# 容器busybox的对端设备
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-989b68fd-d10b-b11b-781e-18feb8b85b12
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
8: cali42cd276b2be@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default			# 其他容器的(coredns)
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-1328bbde-1a3c-a60c-9c3d-f4e1c2fbb3cd
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
       
# 查看calied926cbf5c7@if4网卡的ARP代理参数
[root@k8s-master-1 ~]# cat /proc/sys/net/ipv4/conf/cali1dadcdd5b31/proxy_arp
1
  • 通过veth-pair会传递到对端calixxx上,因为calixxx网卡开启了arp proxy,所以它会代答所有的ARP请求,让容器的报文都发到calixxx上,也就是发送到主机网络栈,再使用主机网络栈的路由来送到下一站. 可以通过cat /proc/sys/net/ipv4/conf/calixxx/proxy_arp/来查看,输出都是1
  • Calico 通过一个巧妙的方法将 workload 的所有流量引导到一个特殊的网关 169.254.1.1,从而引流到主机的 calixxx 网络设备上,最终将二三层流量全部转换成三层流量来转发
  • 在主机上通过开启代理 ARP 功能来实现 ARP 应答,使得 ARP 广播被抑制在主机上,抑制了广播风暴,也不会有 ARP 表膨胀的问题

k8s-master-1的busybox-77649b9c55-fv298尝试ping k8s-node-1的busybox-77649b9c55-s7zfv

# 查看k8s-master-1的busybox容器mac信息(为空)
/ # ip neigh show

# k8s-master-1的busybox 尝试ping k8s-node-1的busybox 
/ # ping -c 1 172.16.109.65
PING 172.16.109.65 (172.16.109.65): 56 data bytes
64 bytes from 172.16.109.65: seq=0 ttl=62 time=1.603 ms

--- 172.16.109.65 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.603/1.603/1.603 ms

# 查看ARP信息
/ # arp -n
? (169.254.1.1) at ee:ee:ee:ee:ee:ee [ether]  on eth0

# 查看k8s-master-1 busybox 当前网卡IP
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1480 qdisc noqueue
    link/ether 86:9c:03:9e:db:9f brd ff:ff:ff:ff:ff:ff
    inet 172.16.196.1/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::849c:3ff:fe9e:db9f/64 scope link
       valid_lft forever preferred_lft forever

# 查看k8s-master-1路由信息
[root@k8s-master-1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.0.2     0.0.0.0         UG    100    0        0 ens160
172.16.109.64   192.168.0.12    255.255.255.192 UG    0      0        0 ens160	# 其他节点的pod网段路由规则
172.16.196.0    0.0.0.0         255.255.255.192 U     0      0        0 *			  # 路由屏蔽,这里是把网段路由那些借助路由黑洞给屏蔽了
172.16.196.1    0.0.0.0         255.255.255.255 UH    0      0        0 cali1dadcdd5b31		# 容器busybox的对端设备
172.16.196.2    0.0.0.0         255.255.255.255 UH    0      0        0 cali42cd276b2be		# 其他容器的(coredns),每一个本机pod一个路由规则
192.168.0.0     0.0.0.0         255.255.255.0   U     100    0        0 ens160

# 查看路由详细信息
[root@k8s-master-1 ~]# ip route show
default via 192.168.0.2 dev ens160 proto static metric 100
172.16.109.64/26 via 192.168.0.12 dev ens160 proto bird
blackhole 172.16.196.0/26 proto bird
172.16.196.1 dev cali1dadcdd5b31 scope link
172.16.196.2 dev cali42cd276b2be scope link
192.168.0.0/24 dev ens160 proto kernel scope link src 192.168.0.11 metric 100

k8s-master-1的busybox-77649b9c55-fv298尝试ping k8s-node-1的busybox-77649b9c55-s7zfv整体流程数据报文流程如下:

  1. 由于172.16.109.65与当前172.16.196.1属于不同的网段,由于跨网段目的MAC地址为网关169.254.1.1的MAC地址,在获取网关的MAC地址时,由于veth-pair特效,eth0(容器)->cali1dadcdd5b31(宿主机)宿主机的cali1dadcdd5b31的网卡开启了ARP代理(ARP欺骗)会将MAC地址:ee:ee:ee:ee:ee:ee返回给容器
  2. 当获取到MAC地址后,构建数据报文:src: 172.16.196.1,dst: 172.16.109.65 src_mac: 86:9c:03:9e:db:9f dst_mac: ee:ee:ee:ee:ee:ee,此时容器查询本机路由规则发现命中默认网关路由,将数据报文丢给eth0,然后基于veth-pair设备对特性,数据报文到达宿主机的cali1dadcdd5b31网卡
[root@k8s-master-1 ~]# tcpdump -i cali1dadcdd5b31 icmp -e -Nnnvl
dropped privs to tcpdump
tcpdump: listening on cali1dadcdd5b31, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:55:45.042609 0e:1c:2c:f6:2a:f9 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 5428, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.196.1 > 172.16.109.65: ICMP echo request, id 17, seq 10, length 64
16:55:45.043076 ee:ee:ee:ee:ee:ee > 0e:1c:2c:f6:2a:f9, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 60905, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.196.1: ICMP echo reply, id 17, seq 10, length 64
  1. 数据报文到达k8s-master-1的cali1dadcdd5b31后进行路由匹配,此时会匹配到172.16.109.64 192.168.0.12 255.255.255.192 UG 0 0 0 ens160网段路由规则(k8s-node-1上所有pod都会命中这个路由规则),在BGP模式下tunl0是不会使用的,所以此时会直接封装数据报文:src: 172.16.196.1(POD IP),dst: 172.16.109.65(POD IP) src_mac: 00:0c:29:b1:02:f0(ens160物理网卡),dst_mac: 00:0c:29:90:fa:e2(ens160物理网卡)。由于源目IP为POD IP,而源目MAC分别为k8s-master-1/k8s-node-1的ens160的MAC地址,这表明:k8s-master-1节点的路由接收到数据,重新构建数据包时,使用arp请求,将k8s-node-1节点的mac拿到,然后封装到数据链路层。这就要求k8s-master-1和k8s-node-1处于同一个二层网络。否则会导致k8s-master-1无法拿到k8s-node-1的MAC地址。从而导致数据报文无法构建。数据报文构建后k8s-master-1将报文从ens160网卡丢出去
[root@k8s-master-1 ~]# tcpdump -i ens160 icmp -Nnnvle
dropped privs to tcpdump
tcpdump: listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:11:07.511766 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51277, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.196.1 > 172.16.109.65: ICMP echo request, id 17, seq 932, length 64
17:11:07.512366 00:0c:29:90:fa:e2 > 00:0c:29:b1:02:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 42499, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.196.1: ICMP echo reply, id 17, seq 932, length 64
17:11:08.512052 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51369, offset 0, flags [DF], proto ICMP (1), length 84)
  1. k8s-master-1将报文从ens160网卡丢出去后,数据报文到达二层交换机设备(虚拟机环境下,为虚拟交换机),由于数据报文的MAC地址中目的MAC地址为k8s-node-1的MAC地址,交换机会把数据报文丢给k8s-node-1节点

  2. k8s-node-1 ens160物理抓包查看报文,可以发现和k8s-master-1 ens160发出来的数据报文一致

[root@k8s-node-1 ~]# tcpdump -i ens160 icmp -Nnnvle
dropped privs to tcpdump
tcpdump: listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:40:05.566173 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 3808, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.196.1 > 172.16.109.65: ICMP echo request, id 19, seq 1314, length 64
17:40:05.566306 00:0c:29:90:fa:e2 > 00:0c:29:b1:02:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 62570, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.196.1: ICMP echo reply, id 19, seq 1314, length 64
  1. k8s-node-1接收到报文后,匹配路由规则:172.16.109.65 0.0.0.0 255.255.255.255 UH 0 0 0 cali4e329df4a89,构建报文:src: 172.16.196.1(POD_IP),dst: 172.16.109.65(POD_IP),src_mac: ee:ee:ee:ee:ee:ee(k8s-node-1的busybox eth0对端的cali4e329df4a89网卡mac地址),dst_mac: 72:19:6b:9b:bf:e2(k8s-node-1的busybox eth0网卡mac地址)
# 查看k8s-node-1物理网卡信息
[root@k8s-node-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:90:fa:e2 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 192.168.0.12/24 brd 192.168.0.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe90:fae2/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 42:2d:65:bc:7a:b1 brd ff:ff:ff:ff:ff:ff
    inet 10.96.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.6.244/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.0.10/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.35.201/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
5: cali4e329df4a89@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default		# busybox容器的网卡
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d7eaf0a1-1638-338d-5309-dbd5ec632608
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
8: cali523840de229@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default	# 其他容器的网卡
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-bd969f87-9e94-599c-119a-f8b8f3efc9c6
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
       
# 查看路由信息
[root@k8s-node-1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.0.2     0.0.0.0         UG    100    0        0 ens160
172.16.109.64   0.0.0.0         255.255.255.192 U     0      0        0 *					# 路由屏蔽,这里是把网段路由那些借助路由黑洞给屏蔽了
172.16.109.65   0.0.0.0         255.255.255.255 UH    0      0        0 cali4e329df4a89	# 容器busybox的对端设备
172.16.109.66   0.0.0.0         255.255.255.255 UH    0      0        0 cali523840de229	# 其他容器的,每一个本机pod一个路由规则
172.16.196.0    192.168.0.11    255.255.255.192 UG    0      0        0 ens160		# 其他节点的pod网段路由规则
192.168.0.0     0.0.0.0         255.255.255.0   U     100    0        0 ens160


# 查看MAC地址
[root@k8s-node-1 ~]# tcpdump -i cali4e329df4a89 -Nnnvle icmp
dropped privs to tcpdump
tcpdump: listening on cali4e329df4a89, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:47:31.196897 ee:ee:ee:ee:ee:ee > 72:19:6b:9b:bf:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 37592, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.196.1 > 172.16.109.65: ICMP echo request, id 29, seq 0, length 64
17:47:31.196950 72:19:6b:9b:bf:e2 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 29435, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.196.1: ICMP echo reply, id 29, seq 0, length 64
    
# 查看ARP表
[root@k8s-node-1 ~]# arp -an
? (192.168.0.1) at 5e:e9:1e:fb:06:64 [ether] on ens160
? (172.16.109.66) at aa:2d:89:c6:49:dd [ether] PERM on cali523840de229
? (192.168.0.11) at 00:0c:29:b1:02:f0 [ether] on ens160
? (172.16.109.65) at 72:19:6b:9b:bf:e2 [ether] PERM on cali4e329df4a89
? (192.168.0.2) at 00:50:56:e0:1b:11 [ether] on ens160

按照上述分析,网络通信流程如下:

  1. busybox(k8s-master-1)-> calixxx -> ens160(k8s-master-1) <----> ens160(k8s-node-1) -> calixxx -> busybox(k8s-node-1)
  2. 根据k8s-master-1宿主机中的路由规则中的下一跳,使用路由规则发送到k8s-node-1的宿主机
  3. BGP模式要求节点必须属于同一个2层网络,由于跨节点POD间的通信报文在节点的物理网卡格式为:src_ip: pod_ip1 dst_ip: pod src_mac: node1_ensxx_mac,dst_mac: node2_ensxx_mac,如果二个节点不属于同一个二层网络,会导致节点之间无法互相获取MAC地址,从而导致数据报文构建失败

BGP同主机分析

  1. 查看Pod信息与宿主机路由信息
# 查看pod分布情况
╰─ kubectl get pods -n devops -o custom-columns=NAME:.metadata.name,IP:.status.podIP,HOST:.spec.nodeName
NAME                       IP              HOST
busybox-77649b9c55-fv298   172.16.196.1    k8s-master-1
busybox-77649b9c55-s7zfv   172.16.109.65   k8s-node-1
jenkins-56b6774bb6-d8v8b   172.16.109.66   k8s-node-1

# k8s-node-1查看路由信息
[root@k8s-node-1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.0.2     0.0.0.0         UG    100    0        0 ens160
172.16.109.64   0.0.0.0         255.255.255.192 U     0      0        0 *
172.16.109.65   0.0.0.0         255.255.255.255 UH    0      0        0 cali4e329df4a89	# busybox容器
172.16.109.66   0.0.0.0         255.255.255.255 UH    0      0        0 cali523840de229 # 其他容器
172.16.196.0    192.168.0.11    255.255.255.192 UG    0      0        0 ens160
192.168.0.0     0.0.0.0         255.255.255.0   U     100    0        0 ens160

# 查看mac信息
[root@k8s-node-1 ~]# arp -an
? (192.168.0.1) at 5e:e9:1e:fb:06:64 [ether] on ens160
? (172.16.109.66) at aa:2d:89:c6:49:dd [ether] PERM on cali523840de229	# 其他容器
? (192.168.0.11) at 00:0c:29:b1:02:f0 [ether] on ens160
? (172.16.109.65) at 72:19:6b:9b:bf:e2 [ether] PERM on cali4e329df4a89	# busybox容器
? (192.168.0.2) at 00:50:56:e0:1b:11 [ether] on ens160

# 查看网卡信息
[root@k8s-node-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:90:fa:e2 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 192.168.0.12/24 brd 192.168.0.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe90:fae2/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 42:2d:65:bc:7a:b1 brd ff:ff:ff:ff:ff:ff
    inet 10.96.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.6.244/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.0.10/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.35.201/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
5: cali4e329df4a89@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default		# busybox容器
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d7eaf0a1-1638-338d-5309-dbd5ec632608	
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
8: cali523840de229@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default	  # 其他容器
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-bd969f87-9e94-599c-119a-f8b8f3efc9c6
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever

当k8s-node-1的busybox容器ping其他容器时,数据包转发流程大致如下:

  1. 172.16.109.65和172.16.109.66属于同一个网段,当是我们查看busybox的路由信息发现并没有172.16.109.64的网段路由信息,所以会去默认网关169.254.1.1获取MAC地址信息,由于busybox的对端网卡:cali4e329df4a89开启了ARP代理,所以会返回ee:ee:ee:ee:ee:ee。然后封装数据报文:src_addr: 172.16.109.65 dst_addr: 172.16.109.66 src_mac: 72:19:6b:9b:bf:e2 dst_mac: ee:ee:ee:ee:ee:ee , 该数据报文会被送到宿主机的cali4e329df4a89
# k8s-node-1的busybox ping 其他容器
/ # ping -c 1 172.16.109.66
PING 172.16.109.66 (172.16.109.66): 56 data bytes
64 bytes from 172.16.109.66: seq=0 ttl=63 time=0.206 ms

--- 172.16.109.66 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.206/0.206/0.206 ms

[root@k8s-node-1 ~]# tcpdump -i cali4e329df4a89 -Nnnvle
dropped privs to tcpdump
tcpdump: listening on cali4e329df4a89, link-type EN10MB (Ethernet), snapshot length 262144 bytes
18:17:06.642985 72:19:6b:9b:bf:e2 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 17087, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.109.66: ICMP echo request, id 37, seq 0, length 64
18:17:06.643095 ee:ee:ee:ee:ee:ee > 72:19:6b:9b:bf:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 8840, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.66 > 172.16.109.65: ICMP echo reply, id 37, seq 0, length 64
  1. 当数据到达宿主机的cali4e329df4a89网卡,此时会进行路由匹配,匹配到规则172.16.109.66 0.0.0.0 255.255.255.255 UH 0 0 0 cali523840de229规则,会将数据报文转发给本机的cali523840de229网卡(因veth-pair特性,这个数据报文会直接被丢给容器),此时会构建数据报文:src_addr: 172.16.109.65 dst_addr: 172.16.109.66 src_mac: ee:ee:ee:ee:ee:ee dst_mac: 2d:89:c6:49:dd
[root@k8s-node-1 ~]# tcpdump -i cali523840de229 -Nnnvle icmp
dropped privs to tcpdump
tcpdump: listening on cali523840de229, link-type EN10MB (Ethernet), snapshot length 262144 bytes
18:19:07.901457 ee:ee:ee:ee:ee:ee > aa:2d:89:c6:49:dd, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 24224, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.109.65 > 172.16.109.66: ICMP echo request, id 38, seq 0, length 64
18:19:07.901483 aa:2d:89:c6:49:dd > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 14344, offset 0, flags [none], proto ICMP (1), length 84)
    172.16.109.66 > 172.16.109.65: ICMP echo reply, id 38, seq 0, length 64
  1. 数据报文回来和出去一致,这里就不分析了

你可能感兴趣的:(Kubernetes,运维,k8s,云原生,kubernetes,网络)