本文主要在k8s原生集群上部署v0.12.1
版本的MetalLB
作为k8s的LoadBalancer
,主要涉及MetalLB的Layer2模式和BGP模式两种部署方案。由于BGP的相关原理和配置比较复杂,这里仅涉及简单的BGP配置。
文中使用的k8s集群是在CentOS7系统上基于docker
和flannel
组件部署v1.23.6
版本,此前写的一些关于k8s基础知识和集群搭建的一些方案,有需要的同学可以看一下。
在开始之前,我们需要了解一下MetalLB的工作原理。
MetalLB hooks into your Kubernetes cluster, and provides a network load-balancer implementation. In short, it allows you to create Kubernetes services of type
LoadBalancer
in clusters that don’t run on a cloud provider, and thus cannot simply hook into paid products to provide load balancers.It has two features that work together to provide this service: address allocation, and external announcement.
MetalLB是 Kubernetes 集群中关于LoadBalancer的一个具体实现,主要用于暴露k8s集群的服务到集群外部访问。MetalLB可以让我们在k8s集群中创建服务类型为LoadBalancer
的服务,并且无需依赖云厂商提供的LoadBalancer
。
它具有两个共同提供此服务的工作负载(workload):地址分配(address allocation)和外部公告(external announcement);对应的就是在k8s中部署的controller
和speaker
。
地址分配(address allocation)这个功能比较好理解,首先我们需要给MetalLB分配一段IP,接着它会根据k8s的service中的相关配置来给LoadBalancer
的服务分配IP,从官网文档中我们可以得知LoadBalancer
的IP可以手动指定,也可以让MetalLB自动分配;同时还可以在MetalLB的configmap
中配置多个IP段,并且单独设置每个IP段是否开启自动分配。
地址分配(address allocation)主要就是由作为deployment
部署的controller
来实现,它负责监听集群中的service状态并且分配IP
外部公告(external announcement)的主要功能就是要把服务类型为LoadBalancer
的服务的EXTERNAL-IP
公布到网络中去,确保客户端能够正常访问到这个IP。MetalLB对此的实现方式主要有三种:ARP/NDP和BGP;其中ARP/NDP分别对应IPv4/IPv6协议的Layer2模式,BGP路由协议则是对应BGP模式。外部公告(external announcement)主要就是由作为daemonset
部署的speaker
来实现,它负责在网络中发布ARP/NDP报文或者是和BGP路由器建立连接并发布BGP报文。
不管是Layer2模式还是BGP模式,两者都不使用Linux的网络栈,也就是说我们没办法使用诸如ip
命令之类的操作准确的查看VIP所在的节点和相应的路由,相对应的是在每个节点上面都能看到一个kube-ipvs0
网卡接口上面的IP。同时,两种模式都只是负责把VIP的请求引到对应的节点上面,之后的请求怎么到达pod,按什么规则轮询等都是由kube-proxy实现的。
两种不同的模式各有优缺点和局限性,我们先把两者都部署起来再进行分析。
在开始部署MetalLB之前,我们需要确定部署环境能够满足最低要求:
MetalLB官方给出了对主流的一些CNI的兼容情况,考虑到MetalLB主要还是利用了k8s自带的kube-proxy组件做流量转发,因此对大多数的CNI兼容情况都相当不错。
CNI | 兼容性 | 主要问题 |
---|---|---|
Calico | Mostly (see known issues) | 主要在于BGP模式的兼容性,但是社区也提供了解决方案 |
Canal | Yes | - |
Cilium | Yes | - |
Flannel | Yes | - |
Kube-ovn | Yes | - |
Kube-router | Mostly (see known issues) | 无法支持 builtin external BGP peering mode |
Weave Net | Mostly (see known issues) | externalTrafficPolicy: Local 支持情况视版本而定 |
从兼容性上面我们不难看出,大多数情况是没问题的,出现兼容性问题的主要原因就是和BGP有冲突。实际上BGP相关的兼容性问题几乎存在于每个开源的k8s负载均衡器上面。
MetalLB官方给出的列表中,我们可以看到对大多数云厂商的兼容性都很差,原因也很简单,大多数的云环境上面都没办法运行BGP协议,而通用性更高的layer2模式则因为各个云厂商的网络环境不同而没办法确定是否能够兼容
The short version is: cloud providers expose proprietary APIs instead of standard protocols to control their network layer, and MetalLB doesn’t work with those APIs.
当然如果使用了云厂商的服务,最好的方案是直接使用云厂商提供的LoadBalance
服务。
本次MetalLB
的部署环境为基于docker
和flannel
部署的1.23.6
版本的k8s集群
IP | Hostname |
---|---|
10.31.8.1 | tiny-flannel-master-8-1.k8s.tcinternal |
10.31.8.11 | tiny-flannel-worker-8-11.k8s.tcinternal |
10.31.8.12 | tiny-flannel-worker-8-12.k8s.tcinternal |
10.8.64.0/18 | podSubnet |
10.8.0.0/18 | serviceSubnet |
10.31.8.100-10.31.8.200 | MetalLB IPpool |
部署Layer2模式需要把k8s集群中的ipvs配置打开strictARP
,开启之后k8s集群中的kube-proxy
会停止响应kube-ipvs0
网卡之外的其他网卡的arp请求,而由MetalLB接手处理。
strict ARP
开启之后相当于把 将 arp_ignore
设置为 1 并将 arp_announce
设置为 2 启用严格的 ARP,这个原理和LVS中的DR模式对RS的配置一样,可以参考之前的文章中的解释。
strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries from kube-ipvs0 interface
# 查看kube-proxy中的strictARP配置
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
strictARP: false
# 手动修改strictARP配置为true
$ kubectl edit configmap -n kube-system kube-proxy
configmap/kube-proxy edited
# 使用命令直接修改并对比不同
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl diff -f - -n kube-system
# 确认无误后使用命令直接修改并生效
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
# 重启kube-proxy确保配置生效
$ kubectl rollout restart ds kube-proxy -n kube-system
# 确认配置生效
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
strictARP: true
MetalLB的部署也十分简单,官方提供了manifest文件部署(yaml部署),helm3部署和Kustomize部署三种方式,这里我们还是使用manifest文件部署。
大多数的官方教程为了简化部署的步骤,都是写着直接用kubectl命令部署一个yaml的url,这样子的好处是部署简单快捷,但是坏处就是本地自己没有存档,不方便修改等操作,因此我个人更倾向于把yaml文件下载到本地保存再进行部署。
# 下载v0.12.1的两个部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
# 如果使用frr来进行BGP路由管理,则下载这两个部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb-frr.yaml
下载官方提供的yaml文件之后,我们再提前准备好configmap
的配置,github上面有提供一个参考文件,layer2模式需要的配置并不多,这里我们只做最基础的一些参数配置定义即可:
protocol
这一项我们配置为layer2
addresses
这里我们可以使用CIDR来批量配置(198.51.100.0/24
),也可以指定首尾IP来配置(192.168.0.150-192.168.0.200
),这里我们指定一段和k8s节点在同一个子网的IP$ cat > configmap-metallb.yaml <apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: default
protocol: layer2
addresses:
- 10.31.8.100-10.31.8.200
EOF
接下来就可以开始进行部署,整体可以分为三步:
namespace
deployment
和daemonset
configmap
# 创建namespace
$ kubectl apply -f namespace.yaml
namespace/metallb-system created
$ kubectl get ns
NAME STATUS AGE
default Active 8d
kube-node-lease Active 8d
kube-public Active 8d
kube-system Active 8d
metallb-system Active 8s
nginx-quic Active 8d
# 部署deployment和daemonset,以及相关所需的其他资源
$ kubectl apply -f metallb.yaml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
role.rbac.authorization.k8s.io/controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
rolebinding.rbac.authorization.k8s.io/controller created
daemonset.apps/speaker created
deployment.apps/controller created
# 这里主要就是部署了controller这个deployment来检查service的状态
$ kubectl get deploy -n metallb-system
NAME READY UP-TO-DATE AVAILABLE AGE
controller 1/1 1 1 86s
# speaker则是使用ds部署到每个节点上面用来协商VIP、收发ARP、NDP等数据包
$ kubectl get ds -n metallb-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
speaker 3 3 3 3 3 kubernetes.io/os=linux 64s
$ kubectl get pod -n metallb-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
controller-57fd9c5bb-svtjw 1/1 Running 0 117s 10.8.65.4 tiny-flannel-worker-8-11.k8s.tcinternal <none> <none>
speaker-bf79q 1/1 Running 0 117s 10.31.8.11 tiny-flannel-worker-8-11.k8s.tcinternal <none> <none>
speaker-fl5l8 1/1 Running 0 117s 10.31.8.12 tiny-flannel-worker-8-12.k8s.tcinternal <none> <none>
speaker-nw2fm 1/1 Running 0 117s 10.31.8.1 tiny-flannel-master-8-1.k8s.tcinternal <none> <none>
$ kubectl apply -f configmap-layer2.yaml
configmap/config created
我们还是自定义一个服务来进行测试,测试镜像使用nginx,默认情况下会返回请求客户端的IP和端口
$ cat > nginx-quic-lb.yaml <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: nginx-quic
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-lb
namespace: nginx-quic
spec:
selector:
matchLabels:
app: nginx-lb
replicas: 4
template:
metadata:
labels:
app: nginx-lb
spec:
containers:
- name: nginx-lb
image: tinychen777/nginx-quic:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-lb-service
namespace: nginx-quic
spec:
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
selector:
app: nginx-lb
ports:
- protocol: TCP
port: 80 # match for service access port
targetPort: 80 # match for pod access port
type: LoadBalancer
loadBalancerIP: 10.31.8.100
EOF
注意上面的配置中我们把service配置中的type
字段指定为LoadBalancer
,并且指定了loadBalancerIP
为10.31.8.100
注意:并非所有的
LoadBalancer
都允许设置loadBalancerIP
。如果
LoadBalancer
支持该字段,那么将根据用户设置的loadBalancerIP
来创建负载均衡器。如果没有设置
loadBalancerIP
字段,将会给负载均衡器指派一个临时 IP。如果设置了
loadBalancerIP
,但LoadBalancer
并不支持这种特性,那么设置的loadBalancerIP
值将会被忽略掉。
# 创建一个测试服务检查效果
$ kubectl apply -f nginx-quic-lb.yaml
namespace/nginx-quic created
deployment.apps/nginx-lb created
service/nginx-lb-service created
查看服务状态,这时候TYPE已经变成LoadBalancer
,EXTERNAL-IP
显示为我们定义的10.31.8.100
# 查看服务状态,这时候TYPE已经变成LoadBalancer
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.32.221 10.31.8.100 80:30181/TCP 25h
此时我们再去查看k8s机器中的nginx-lb-service
状态,可以看到ClusetIP
、LoadBalancer-VIP
和nodeport
的相关信息以及流量策略TrafficPolicy
等配置
$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"nginx-lb-service","namespace":"nginx-quic"},"spec":{"externalTrafficPolicy":"Cluster","internalTrafficPolicy":"Cluster","loadBalancerIP":"10.31.8.100","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx-lb"},"type":"LoadBalancer"}}
creationTimestamp: "2022-05-16T06:01:23Z"
name: nginx-lb-service
namespace: nginx-quic
resourceVersion: "1165135"
uid: f547842e-4547-4d01-abbc-89ac8b059a2a
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.8.32.221
clusterIPs:
- 10.8.32.221
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
loadBalancerIP: 10.31.8.100
ports:
- nodePort: 30181
port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx-lb
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 10.31.8.100
查看IPVS规则,这时候可以看到ClusetIP、LoadBalancer-VIP和nodeport的转发规则,默认情况下在创建LoadBalance的时候还会创建nodeport服务:
$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 172.17.0.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.32.221:80 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.64.0:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.8.64.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.31.8.1:30181 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
TCP 10.31.8.100:80 rr
-> 10.8.65.15:80 Masq 1 0 0
-> 10.8.65.16:80 Masq 1 0 0
-> 10.8.66.12:80 Masq 1 0 0
-> 10.8.66.13:80 Masq 1 0 0
使用curl检查服务是否正常
$ curl 10.31.8.100:80
10.8.64.0:60854
$ curl 10.8.1.166:80
10.8.64.0:2562
$ curl 10.31.8.1:30974
10.8.64.0:1635
$ curl 10.31.8.100:80
10.8.64.0:60656
在每台k8s节点机器上面的kube-ipvs0
网卡上面都能看到这个LoadBalancer的VIP:
$ ip addr show kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 4e:ba:e8:25:cf:17 brd ff:ff:ff:ff:ff:ff
inet 10.8.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.8.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.8.32.221/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.31.8.100/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
想要定位到VIP在那个节点上面则比较麻烦,我们可以找一台和K8S集群处于同一个二层网络的机器,查看arp表,再根据mac地址找到对应的节点IP,这样子可以反查到IP在哪个节点上面。
$ arp -a | grep 10.31.8.100
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0
$ arp -a | grep 52:54:00:5c:9c:97
tiny-flannel-worker-8-12.k8s.tcinternal (10.31.8.12) at 52:54:00:5c:9c:97 [ether] on eth0
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0
$ ip a | grep 52:54:00:5c:9c:97
link/ether 52:54:00:5c:9c:97 brd ff:ff:ff:ff:ff:ff
又或者我们可以查看speaker的pod日志,我们可以找到对应的服务IP被宣告的日志记录
$ kubectl logs -f -n metallb-system speaker-fl5l8
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:11:34.099204376Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.527334808Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.547734268Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.267651651Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.286130424Z"}
相信不少细心的同学已经发现了,我们在创建LoadBalancer
服务的时候,默认情况下k8s会帮我们自动创建一个nodeport
服务,这个操作可以通过指定Service
中的allocateLoadBalancerNodePorts
字段来定义开关,默认情况下为true
不同的loadbalancer实现原理不同,有些是需要依赖nodeport来进行流量转发,有些则是直接转发请求到pod中。对于MetalLB而言,是通过kube-proxy将请求的流量直接转发到pod,因此我们需要关闭nodeport的话可以修改service中的spec.allocateLoadBalancerNodePorts
字段,将其设置为false,那么在创建svc的时候就不会分配nodeport。
但是需要注意的是如果是对已有service进行修改,关闭nodeport(从true改为false),k8s不会自动去清除已有的ipvs规则,这需要我们自行手动删除。
我们重新定义创建一个svc
apiVersion: v1
kind: Service
metadata:
name: nginx-lb-service
namespace: nginx-quic
spec:
allocateLoadBalancerNodePorts: false
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
selector:
app: nginx-lb
ports:
- protocol: TCP
port: 80 # match for service access port
targetPort: 80 # match for pod access port
type: LoadBalancer
loadBalancerIP: 10.31.8.100
此时再去查看对应的svc状态和ipvs规则会发现已经没有nodeport相关的配置
$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.8.62.180:80 rr
-> 10.8.65.18:80 Masq 1 0 0
-> 10.8.65.19:80 Masq 1 0 0
-> 10.8.66.14:80 Masq 1 0 0
-> 10.8.66.15:80 Masq 1 0 0
TCP 10.31.8.100:80 rr
-> 10.8.65.18:80 Masq 1 0 0
-> 10.8.65.19:80 Masq 1 0 0
-> 10.8.66.14:80 Masq 1 0 0
-> 10.8.66.15:80 Masq 1 0 0
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.62.180 10.31.8.100 80/TCP 23s
如果是把已有服务的spec.allocateLoadBalancerNodePorts
从true
改为false
,原有的nodeport
不会自动删除,因此最好在初始化的时候就规划好相关参数
$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml | egrep " allocateLoadBalancerNodePorts: "
allocateLoadBalancerNodePorts: false
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.62.180 10.31.8.100 80:31405/TCP 85m
测试环境的网络拓扑非常的简单,MetalLB的网段为了和前面Layer2模式进行区分,更换为10.9.0.0/16
,具体信息如下
IP | Hostname |
---|---|
10.31.8.1 | tiny-flannel-master-8-1.k8s.tcinternal |
10.31.8.11 | tiny-flannel-worker-8-11.k8s.tcinternal |
10.31.8.12 | tiny-flannel-worker-8-12.k8s.tcinternal |
10.31.254.251 | OpenWrt |
10.9.0.0/16 | MetalLB BGP IPpool |
三台k8s的节点直连Openwrt路由器,OpenWRT作为k8s节点的网关的同时,还在上面跑BGP协议,将对MetalLB使用的VIP的请求路由到各个k8s节点上。
在开始配置之前,我们需要给路由器和k8s节点都分配一个私有的AS号,这里可以参考wiki上面的AS号划分使用。这里我们路由器使用AS号为64512,MetalLB使用AS号为64513。
以家里常见的openwrt路由器为例,我们先在上面安装quagga组件,当然要是使用的openwrt版本编译了frr模块的话推荐使用frr来进行配置。
如果使用的是别的发行版Linux(如CentOS或者Debian)推荐直接使用
frr
进行配置。
我们先在openwrt上面直接使用opkg安装quagga
$ opkg update
$ opkg install quagga quagga-zebra quagga-bgpd quagga-vtysh
如果使用的openwrt版本足够新,是可以直接使用opkg安装frr组件的
$ opkg update
$ opkg install frr frr-babeld frr-bfdd frr-bgpd frr-eigrpd frr-fabricd frr-isisd frr-ldpd frr-libfrr frr-nhrpd frr-ospf6d frr-ospfd frr-pbrd frr-pimd frr-ripd frr-ripngd frr-staticd frr-vrrpd frr-vtysh frr-watchfrr frr-zebra
如果是使用frr记得在配置中开启bgpd参数再重启frr
$ sed -i 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
$ /etc/init.d/frr restart
下面的服务配置以frr
为例,实际上使用quagga
的话也是使用vtysh
进行配置或者是直接修改配置文件,两者区别不大。
检查服务是否监听了2601和2605端口
root@OpenWrt:~# netstat -ntlup | egrep "zebra|bgpd"
tcp 0 0 0.0.0.0:2601 0.0.0.0:* LISTEN 3018/zebra
tcp 0 0 0.0.0.0:2605 0.0.0.0:* LISTEN 3037/bgpd
BGP协议使用的179端口还没有被监听是因为我们还没有进行配置,这里我们可以直接使用vtysh进行配置或者是直接修改配置文件然后重启服务。
直接在命令行输入vtysh就可以进入到vtysh的配置终端(和kvm虚拟化的virsh类似),这时候注意留意终端的提示符变化了
root@OpenWrt:~# vtysh
Hello, this is Quagga (version 1.2.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
OpenWrt#
但是命令行配置比较麻烦,我们也可以直接修改配置文件然后重启服务。
quagga修改的bgp配置文件默认是/etc/quagga/bgpd.conf
,不同的发行版和安装方式可能会不同。
$ cat /etc/quagga/bgpd.conf
!
! Zebra configuration saved from vty
! 2022/05/19 11:01:35
!
password zebra
!
router bgp 64512
bgp router-id 10.31.254.251
neighbor 10.31.8.1 remote-as 64513
neighbor 10.31.8.1 description 10-31-8-1
neighbor 10.31.8.11 remote-as 64513
neighbor 10.31.8.11 description 10-31-8-11
neighbor 10.31.8.12 remote-as 64513
neighbor 10.31.8.12 description 10-31-8-12
maximum-paths 3
!
address-family ipv6
exit-address-family
exit
!
access-list vty permit 127.0.0.0/8
access-list vty deny any
!
line vty
access-class vty
!
如果使用的是frr,那么配置文件会有所变化,需要修改的是/etc/frr/frr.conf
,不同的发行版和安装方式可能会不同。
$ cat /etc/frr/frr.conf
frr version 8.2.2
frr defaults traditional
hostname tiny-openwrt-plus
!
password zebra
!
router bgp 64512
bgp router-id 10.31.254.251
no bgp ebgp-requires-policy
neighbor 10.31.8.1 remote-as 64513
neighbor 10.31.8.1 description 10-31-8-1
neighbor 10.31.8.11 remote-as 64513
neighbor 10.31.8.11 description 10-31-8-11
neighbor 10.31.8.12 remote-as 64513
neighbor 10.31.8.12 description 10-31-8-12
!
address-family ipv4 unicast
exit-address-family
exit
!
access-list vty seq 5 permit 127.0.0.0/8
access-list vty seq 10 deny any
!
line vty
access-class vty
exit
!
完成配置后需要重启服务
# 重启frr的命令
$ /etc/init.d/frr restart
# 重启quagge的命令
$ /etc/init.d/quagga restart
重启后我们进入vtysh查看bgp的状态
tiny-openwrt-plus# show ip bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 2149 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.31.8.1 4 64513 0 0 0 0 0 never Active 0 10-31-8-1
10.31.8.11 4 64513 0 0 0 0 0 never Active 0 10-31-8-11
10.31.8.12 4 64513 0 0 0 0 0 never Active 0 10-31-8-12
Total number of neighbors 3
这时候再查看路由器的监听端口,可以看到BGP已经跑起来了
$ netstat -ntlup | egrep "zebra|bgpd"
tcp 0 0 127.0.0.1:2605 0.0.0.0:* LISTEN 31625/bgpd
tcp 0 0 127.0.0.1:2601 0.0.0.0:* LISTEN 31618/zebra
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 31625/bgpd
tcp 0 0 :::179 :::* LISTEN 31625/bgpd
首先我们修改configmap
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
peers:
- peer-address: 10.31.254.251
peer-port: 179
peer-asn: 64512
my-asn: 64513
address-pools:
- name: default
protocol: bgp
addresses:
- 10.9.0.0/16
修改完成后我们重新部署configmap
,并检查metallb
的状态
$ kubectl apply -f configmap-metal.yaml
configmap/config configured
$ kubectl get cm -n metallb-system config -o yaml
apiVersion: v1
data:
config: |
peers:
- peer-address: 10.31.254.251
peer-port: 179
peer-asn: 64512
my-asn: 64513
address-pools:
- name: default
protocol: bgp
addresses:
- 10.9.0.0/16
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"config":"peers:\n- peer-address: 10.31.254.251\n peer-port: 179\n peer-asn: 64512\n my-asn: 64513\naddress-pools:\n- name: default\n protocol: bgp\n addresses:\n - 10.9.0.0/16\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"config","namespace":"metallb-system"}}
creationTimestamp: "2022-05-16T04:37:54Z"
name: config
namespace: metallb-system
resourceVersion: "1412854"
uid: 6d94ca36-93fe-4ea2-9407-96882ad8e35c
此时我们从路由器上面可以看到已经和三个k8s节点建立了BGP连接
tiny-openwrt-plus# show ip bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 3
RIB entries 5, using 920 bytes of memory
Peers 3, using 2149 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.31.8.1 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-1
10.31.8.11 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-11
10.31.8.12 4 64513 6 4 0 0 0 00:00:45 3 3 10-31-8-12
Total number of neighbors 3
如果出现某个节点的BGP连接建立失败的情况,可以重启该节点上面的speaker来重试建立BGP连接
$ kubectl delete po speaker-fl5l8 -n metallb-system
当configmap更改生效之后,原有服务的EXTERNAL-IP
不会重新分配
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.31.8.100 80/TCP 18h
此时我们可以重启controller
,让它重新为我们的服务分配EXTERNAL-IP
$ kubectl delete po -n metallb-system controller-57fd9c5bb-svtjw
pod "controller-57fd9c5bb-svtjw" deleted
重启完成之后我们再检查svc的状态,如果svc的配置中关于LoadBalancer的VIP是自动分配的(即没有指定loadBalancerIP
字段),那么这时候应该就已经拿到新的IP在正常运行了,但是我们这个服务的loadBalancerIP
之前手动指定为10.31.8.100了,这里的EXTERNAL-IP
状态就变为pending
。
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 <pending> 80/TCP 18h
重新修改loadBalancerIP
为10.9.1.1
,此时可以看到服务已经正常
$ kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.9.1.1 80/TCP 18h
再查看controller的日志可以看到
$ kubectl logs controller-57fd9c5bb-d6jsl -n metallb-system
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB controller starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-05-18T03:45:45.440872105Z","version":"0.12.1"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:45:45.610395481Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611009691Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611062419Z"}
{"caller":"level.go:63","error":"controller not synced","level":"error","msg":"controller not synced yet, cannot allocate IP; will retry after sync","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611080525Z"}
{"caller":"level.go:63","event":"stateSynced","level":"info","msg":"controller synced, can allocate IPs now","ts":"2022-05-18T03:45:45.611117023Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617013146Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617089367Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617122976Z"}
{"caller":"level.go:63","event":"serviceUpdated","level":"info","msg":"updated service object","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626039403Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626361986Z"}
{"caller":"level.go:63","event":"ipAllocated","ip":["10.9.1.1"],"level":"info","msg":"IP address assigned by controller","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.943434144Z"}
再查看speaker的日志我们可以看到和路由之间成功建立BGP连接的日志、使用了不符合规范的loadBalancerIP
10.31.8.100的报错日志,以及为loadBalancerIP
10.9.1.1分配BGP路由的日志
$ kubectl logs -n metallb-system speaker-bf79q
{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.31.254.251","ts":"2022-05-18T03:41:55.046091105Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:41:55.046268735Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:41:55.051955069Z"}
struct { Version uint8; ASN16 uint16; HoldTime uint16; RouterID uint32; OptsLen uint8 }{Version:0x4, ASN16:0xfc00, HoldTime:0xb4, RouterID:0xa1ffefd, OptsLen:0x1e}
{"caller":"level.go:63","event":"sessionUp","level":"info","localASN":64513,"msg":"BGP session established","peer":"10.31.254.251:179","peerASN":64512,"ts":"2022-05-18T03:41:55.052734174Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-05-18T03:42:40.183574415Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeLeave","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:03.649494062Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:03.655003303Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeJoin","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:06.247929645Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:06.25369106Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["10.9.1.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953729779Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.9.1.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953912236Z"}
我们在集群外的任意一个机器进行测试
$ curl -v 10.9.1.1
* About to connect() to 10.9.1.1 port 80 (#0)
* Trying 10.9.1.1...
* Connected to 10.9.1.1 (10.9.1.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.9.1.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Wed, 18 May 2022 04:17:41 GMT
< Content-Type: text/plain
< Content-Length: 16
< Connection: keep-alive
<
10.8.64.0:43939
* Connection #0 to host 10.9.1.1 left intact
此时再查看路由器上面的路由状态,可以看到有关于10.9.1.1
的/32
路由,这时候的下一条有多个IP,说明已经成功开启了ECMP。
tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:04:52
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:01:40
* via 10.31.8.11, eth0, weight 1, 00:01:40
* via 10.31.8.12, eth0, weight 1, 00:01:40
C>* 10.31.0.0/16 is directly connected, eth0, 00:04:52
我们再创建多几个服务进行测试
# kubectl get svc -n nginx-quic
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb-service LoadBalancer 10.8.4.92 10.9.1.1 80/TCP 23h
nginx-lb2-service LoadBalancer 10.8.10.48 10.9.1.2 80/TCP 64m
nginx-lb3-service LoadBalancer 10.8.6.116 10.9.1.3 80/TCP 64m
再查看此时路由器的状态
tiny-openwrt-plus# show ip bgp
BGP table version is 3, local router ID is 10.31.254.251, vrf id 0
Default local pref 100, local AS 64512
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*= 10.9.1.1/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
*= 10.9.1.2/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
*= 10.9.1.3/32 10.31.8.12 0 64513 ?
*> 10.31.8.1 0 64513 ?
*= 10.31.8.11 0 64513 ?
Displayed 3 routes and 9 total paths
tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:06:12
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.2/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.3/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
* via 10.31.8.11, eth0, weight 1, 00:03:00
* via 10.31.8.12, eth0, weight 1, 00:03:00
C>* 10.31.0.0/16 is directly connected, eth0, 00:06:12
只有当路由表显示我们的LoadBalancerIP的下一跳有多个IP的时候,才说明ECMP配置成功,否则需要检查BGP的配置是否正确。
优点:
缺点:
改进方案:
BGP模式的优缺点几乎和Layer2模式相反
优点:
缺点:
路由器中使用的哈希值通常 不稳定,因此每当后端集的大小发生变化时(例如,当一个节点的 BGP 会话关闭时),现有的连接将被有效地随机重新哈希,这意味着大多数现有的连接最终会突然被转发到不同的后端,而这个后端可能和此前的后端毫不相干且不清楚上下文状态信息。
改进方案:
MetalLB给出了一些改进方案,下面列出来给大家参考一下
这里尽量客观的总结概况一些客观事实,是否为优缺点可能会因人而异:
总的来说,MetalLB作为一款处于beta阶段的开源负载均衡器,很好地弥补了这一块领域的空白,并且对后面开源的一些同类服务有着一定的影响。但是从实际生产落地的角度,给我的感觉就是目前更倾向于有得用且能用,并不能算得上好用,但是又考虑到MetalLB最开始只是一个个人开源项目,最近才有专门的组织进行管理维护,这也是可以理解的,希望它能够发展得更好吧。