nic bond 2022-12-13

ref: https://www.kernel.org/doc/Documentation/networking/bonding.txt
网卡bonding:
bandwidth share/Load balance + Fault tolerance

overview:

The Linux bonding driver[12] provides a method for aggregating multiple network interface controllers (NICs) into a single logical bonded interface of two or more so-called (NIC) slaves. The majority of modern Linux distributions come with a Linux kernel which has the Linux bonding driver integrated as a loadable kernel module and the ifenslave (if = [network] interface) user-level control program pre-installed.

The behavior of the bonded interfaces depends upon the mode

7 modes:

Modes for the Linux bonding driver (network interface aggregation modes) are supplied as parameters to the kernel bonding module at load time. Modes may be given as command-line arguments to the insmod or modprobe commands, but are usually specified in a Linux distribution-specific configuration file.

A bonding mode specifies the policy indicating how bonding slaves are used during network transmission. The default parameter is balance-rr.
One bonding interface can only specify one mode
The choice of mode is dependent on the network topology, requirements for the bonding behaviors, and characteristics of the slave devices.

  • mode 0 (balance-rr, default)
    Round-robin policy 轮询. Transmits packets in sequential order from the first available slave through the last. This mode provides load balancing and fault tolerance.
    mode 0下bond所绑定的网卡都被修改成相同的mac地址,如果这些网卡都被接在同一个交换机,那么交换机的arp表里这个mac地址对应的端口就有多个,那么交换机接受到发往这个mac地址的包应该往哪个端口转发呢?正常情况下mac地址是全球唯一的,一个mac地址对应多个端口肯定使交换机迷惑了。所以 mode0下的bond如果连接到交换机,交换机这几个端口应该采取聚合方式(cisco称为 ethernetchannel,foundry称为portgroup),因为交换机做了聚合后,聚合下的几个端口也被捆绑成一个mac地址。或者两个网卡接入不同的交换机也可以
    缺点:如果一个连接或者会话的数据包从不同的接口发出的话,中途再经过不同的链路,在客户端很有可能会出现数据包无序到达的问题,而无序到达的数据包需要重新要求被发送,网络的吞吐量就会下降

  • mode 1 (active-backup)
    Active-backup policy 一活多备. Establishes that only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one switch port (network adapter) to avoid confusing the switch. This mode provides fault tolerance. The primary option affects the behavior of this mode.
    缺点:资源利用率低

  • mode 2 (balance-xor)
    hash平衡策略。Transmits based on the selected transmit hash policy, which can be altered via the xmit_hash_policy option. This mode provides load balancing and fault tolerance.

  • mode 3 (broadcast)
    广播。Transmits everything on all slave interfaces. This mode provides fault tolerance. 每个要对外发送的数据包都复制到各个slave上对外发送

  • mode 4 (802.3ad, aka Dynamic link aggregation)
    动态链路聚合. IEEE 802.3ad Dynamic link aggregation policy. Creates aggregation groups that share the same speed and duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification.
    需要交换机支持IEEE 802.3ad Dynamic link aggregation
    链路聚合(Link Aggregation)是一个计算机网络术语,指将多个物理端口汇聚在一起,形成一个逻辑端口,以实现出/入流量吞吐量在各成员端口的负荷分担,交换机根据用户配置的端口负荷分担策略决定网络封包从哪个成员端口发送到对端的交换机。链路聚合在增加链路带宽、实现链路传输弹性和工程冗余等方面是一项很重要的技术。详情见下方about Link Aggregation and LACP

  • mode 5 (balance-tlb)
    自适应lb。Adaptive transmit load balancing. Establishes channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.
    不需要任何特别的switch(交换机)支持的通道bonding。在每个slave上根据当前的负载(根据速度计算)分配外出流量。如果正在接受数据的slave出故障了,另一个slave接管失败的slave的MAC地址。这表明,每个slave仍然持有自己的mac地址

  • mode 6 (balance-alb)
    Adaptive load balancing. Includes balance-transmit load balancing, plus receive-load balancing for IPv4 traffic, and does not require any special switch support.
    mode 6 对外可见一个IP和多个mac。其中,有一个slave的mac地址与bond的mac地址相同,而剩余slave的仍然拥有各自独立的mac

    • a. The receive-load balancing is achieved by ARP negotiation.
      The bonding driver intercepts the ARP replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond. Thus, different peers use different hardware addresses(mac addresses) for the server.
      当系统进行ARP应答时,bonding driver把ARP reply的源mac由bond mac改成bond中一个slave的mac,源IP保持不变。这样一来,不同的用户收到的ARP reply不同,对于同一个IP,不同的用户缓存了不同的mac,于是不同用户的包在数据链路层将被交换机从不同的端口发送出去
    • b. Receive traffic from connections created by the server is also balanced.
      When the local system sends an ARP Request the bonding driver copies and saves the peer's IP information from the ARP packet. When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond.
      系统发送一个ARP Request,等ARP Reply从对端到来时,the bonding driver会从中提取出对端的mac地址,然后重新生成一个新的ARP Reply发往其中一个slave

总之,系统在进行ARP应答时,bonding driver做了劫持,使得应答的mac是其中一个slave的mac;系统在接收外部的ARP reply时,bonding driver也做了劫持,最终只有其中一个slave收到了这个reply

The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond.

A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. (有个问题是,mode 6下,当an ARP request is broadcast,对端总是会学习到固定的hardware address of the bond,这种效果和上文a点的目的背道而驰
This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated.

mode的交换机支持:

部分mode需要上游交换机的支持:

  • 需要配置交换机:

    • rr/balance-xor/broadcast. These require static aggregation theoretically (多个active的物理端口但对外只有一个mac,因此需要上游直连的交换机做静态端口聚合)
    • 802.3ad(交换机需要做动态端口聚合, Dynamic Link Aggregation)
      链路聚合控制协议(Link Aggregation Control Protocol, LACP)提供了一种控制多个物理接口聚合成一个单一逻辑通道的方式。LACP通过发送LACP数据包给对端(直接连接的设备也实现LACP)允许网络设备协商出一个自动聚合的链路。捆绑多个接口的方式会共享一个逻辑地址(如IP)或者一个物理地址(如MAC地址)
  • 不需配置交换机:

    • active-backup[只有一个active的物理端口,整个bond对外只有一个mac地址,因此就是常规的交换机直连方式。The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch.]
    • balance-tlb/balance-alb(一个active的端口对应一个mac地址,就是常规的交换机直连方式)

Normally speaking,

  • mode=0 (balance-rr); mode=2 (balance-xor); mode=4 (802.3ad); mode=5 (balance-tlb); mode=6 (balance-alb) are used for single switch topologies.
  • the left two modes—mode=1 (active-backup) & mode=3 (broadcast) are applied for mulitple switch topology naturally.

xmit_hash_policy:

Selects the transmit hash policy to use for slave selection in balance-xor, 802.3ad, and tlb modes.
xmit_hash_policy是重要程度仅次于mode的参数

about Link Aggregation and LACP:

Link Aggregation是思想,Network architects can implement aggregation at any of the lowest three layers of the OSI model. 物理层/数据链路层/网络层都可以做Link Aggregation

LACP是数据链路层做Link Aggregation的具体协议,是思想的一个实现

Ethernet bandwidths historically have increased tenfold each generation: 10 megabit/s, 100 Mbit/s, 1000 Mbit/s, 10,000 Mbit/s. If one started to bump into bandwidth ceilings, then the only option was to move to the next generation, which could be cost prohibitive. An alternative solution, introduced by many of the network manufacturers in the early 1990s, is to use link aggregation to combine two physical Ethernet links into one logical link. Most of these early solutions required manual configuration and identical equipment on both sides of the connection
当以太网带宽需要升级,但升级到下一代又太贵,怎么办?把几个以太网链路合成一个逻辑链路,把带宽撑上去——集中力量办大事

Within the IEEE Ethernet standards, the Link Aggregation Control Protocol (LACP) provides a method to control the bundling of several physical links together to form a single logical link. LACP allows a network device to negotiate an automatic bundling of links by sending LACP packets to their peer, a directly connected device that also implements LACP.

how to bond NICs

配置bonding device,和很多linux的配置类似,分为临时配置持久化配置两种

  • 临时配置
    临时修改bonding device配置:通过sysfs可以实时修改当前的bonding device配置
    实时读取bonding device配置:Each bonding device has a read-only file residing in the
    /proc/net/bonding directory。如下:
# 通过sysfs即时从bond1中踢出eth1
[root@TENCENT64 /etc/sysconfig/network-scripts]# echo -eth1 > /sys/class/net/bond1/bonding/slaves
# eth1被踢出后,bond1的实时配置已经没有eth1
[root@TENCENT64 /etc/sysconfig/network-scripts]# cat /proc/net/bonding/bond1 | grep eth1
# eth1被踢出后,自己的网卡状态也从up转为down,因此ifconfig没有eth1的结果
[root@TENCENT64 /etc/sysconfig/network-scripts]# ifconfig | grep eth1
# 通过sysfs即时把eth1加入bond1
[root@TENCENT64 /etc/sysconfig/network-scripts]# echo +eth1 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /etc/sysconfig/network-scripts]# cat /proc/net/bonding/bond1 | grep eth1
Slave Interface: eth1
[root@TENCENT64 /etc/sysconfig/network-scripts]# ifconfig | grep eth1
eth1: flags=6211  mtu 1500
[root@TENCENT64 /etc/sysconfig/network-scripts]# 
interesting case

Interfaces may be enslaved to a bond using the file /sys/class/net//bonding/slaves.

To enslave interface eth0 to bond bond0:
# ifconfig bond0 up
# echo +eth0 > /sys/class/net/bond0/bonding/slaves

When an interface is enslaved to a bond, symlinks between the two are created in the sysfs filesystem.
In this case, you would get

  • /sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0
  • /sys/class/net/eth0/master pointing to /sys/class/net/bond0.
    (这里成环了,bond0的slave指向eth0,eth0的master指向bond0)

This means that you can tell quickly whether or not an interface is enslaved by looking for its master symlink. Thus:
echo -eth0 > /sys/class/net/eth0/master/bonding/slaves
will free eth0 from whatever bond it is enslaved to, regardless of the name of the bond interface.

  • 持久化配置
    /etc/sysconfig/network-scripts/ifcfg-nicname 是网卡的持久化配置:
[root@TENCENT64 /sys/class/net/bond1/bonding]# cat miimon 
50
[root@TENCENT64 /sys/class/net/bond1/bonding]# cat /proc/net/bonding/bond1 | grep "MII Polling Interval"
MII Polling Interval (ms): 50
# 通过sysfs实时修改bond1.miimon生效
[root@TENCENT64 /sys/class/net/bond1/bonding]# echo 70 > miimon
[root@TENCENT64 /sys/class/net/bond1/bonding]# cat /proc/net/bonding/bond1 | grep "MII Polling Interval"
MII Polling Interval (ms): 70
# 但/etc/sysconfig/network-scripts/ifcfg-bond1 的 miimon配置仍为100
# 网络重启后会读取这里的配置,miimon会被置为100。这就是持久化
[root@TENCENT64 /sys/class/net/bond1/bonding]# cat /etc/sysconfig/network-scripts/ifcfg-bond1 | grep miimon
BONDING_OPTS='mode=4 miimon=100 lacp_rate=fast xmit_hash_policy=layer3+4'
[root@TENCENT64 /sys/class/net/bond1/bonding]# 
[root@TENCENT64 /sys/class/net/bond1/bonding]# cat /etc/sysconfig/network-scripts/ifcfg-bond1
#IP Config for bond1:
DEVICE=bond1
ONBOOT=yes
BOOTPROTO=static
NM_CONTROLLED=yes
DELAY=0
IPADDR='100.119.aa.xx'
NETMASK='255.255.255.192'
GATEWAY='100.119.aa.x'
BONDING_OPTS='mode=4 miimon=100 lacp_rate=fast xmit_hash_policy=layer3+4'
ETHTOOL_OPTS="-K bond1 tso off gso off lro off"
[root@TENCENT64 /sys/class/net/bond1/bonding]# 

配置bond时,系统发生了什么?

试验的bond1如下,

  • bond1由eth0和eth1组成
  • bond1 mac为6c:92:bf:be:64:84
[root@TENCENT64 /etc/sysconfig/network-scripts]# ifconfig
bond1: flags=5187  mtu 1500
        inet 100.119.xx.xx  netmask 255.255.255.192  broadcast 100.119.21.63
        inet6 fe80::6e92:bfff:febe:6484  prefixlen 64  scopeid 0x20
        ether 6c:92:bf:be:64:84  txqueuelen 1000  (Ethernet)
        ...

eth0: flags=6211  mtu 1500
        ether 6c:92:bf:be:64:84  txqueuelen 1000  (Ethernet)
        ...

eth1: flags=6211  mtu 1500
        ether 6c:92:bf:be:64:84  txqueuelen 1000  (Ethernet)
        ...

lo: ...

其中,eth0的原mac是6c:92:bf:be:64:84,和bond一致——The bond device will take-on the MAC address of the first slave that is added to it and if this slave is removed, it will not automatically change the address of the bond.
而eth0的原mac从下面可以看到,是6c:92:bf:be:64:85

[root@TENCENT64 /etc/sysconfig/network-scripts]# echo -eth1 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /etc/sysconfig/network-scripts]# cat /sys/class/net/eth1/address 
6c:92:bf:be:64:85

借助dmesg,我们可以看看配置bond时,系统发生了什么

[root@TENCENT64 /etc/sysconfig/network-scripts]# echo -eth0 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /proc/net/bonding]# dmesg -T | tail -n 10
...
[Thu Dec 15 15:43:34 2022] bond1: Releasing backup interface eth0
[Thu Dec 15 15:43:34 2022] bond1: the permanent HWaddr of eth0 - 6c:92:bf:be:64:84 - is still in use by bond1 - set the HWaddr of eth0 to a different address to avoid conflicts 
=>解析:此时bond1中还有一个slave eth1, bond1还是active, 并且仍在使用6c:92:bf:be:64:84
=>而6c:92:bf:be:64:84是eth0的实际mac地址,所以需要给eth0先set一个临时mac
[Thu Dec 15 15:43:34 2022] i40e 0000:1a:00.0 eth0: already using mac address 6c:92:bf:be:64:84
=> 这时eth0已经down掉,此时拿回原来的mac地址,不会冲突
[root@TENCENT64 /proc/net/bonding]# 

-------

[root@TENCENT64 /etc/sysconfig/network-scripts]# echo +eth0 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /proc/net/bonding]# dmesg -T | tail -n 10
...
[Thu Dec 15 15:45:14 2022] i40e 0000:1a:00.0 eth0: already using mac address 6c:92:bf:be:64:84
[Thu Dec 15 15:45:14 2022] 8021q: adding VLAN 0 to HW filter on device eth0
[Thu Dec 15 15:45:14 2022] bond1: Enslaving eth0 as a backup interface with an up link

-------

[root@TENCENT64 /etc/sysconfig/network-scripts]# echo -eth1 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /proc/net/bonding]# dmesg -T | tail -n 10
...
[Thu Dec 15 15:46:47 2022] bond1: Releasing backup interface eth1
[Thu Dec 15 15:46:47 2022] i40e 0000:1a:00.1 eth1: returning to hw mac address 6c:92:bf:be:64:85
=> eth1可以直接拿回自己的原mac 6c:92:bf:be:64:85,因为原mac和bond1 mac并不冲突
[root@TENCENT64 /proc/net/bonding]# 

-------

echo +eth1 > /sys/class/net/bond1/bonding/slaves
[root@TENCENT64 /proc/net/bonding]# dmesg -T | tail -n 10
...
[Thu Dec 15 15:48:12 2022] i40e 0000:1a:00.1 eth1: set new mac address 6c:92:bf:be:64:84
[Thu Dec 15 15:48:12 2022] 8021q: adding VLAN 0 to HW filter on device eth1
[Thu Dec 15 15:48:12 2022] bond1: Enslaving eth1 as a backup interface with an up link
[root@TENCENT64 /proc/net/bonding]# 

Create a Channel Bonding Interface

To create a channel bonding interface, create a file in the /etc/sysconfig/network-scripts/ directory called ifcfg-bond*N*, replacing N with the number for the interface, such as 0.

The contents of the file can be based on a configuration file for whatever type of interface is getting bonded, such as an Ethernet interface. The essential differences are that the DEVICE directive is bond*N*, replacing N with the number for the interface, and TYPE=Bond. In addition, set BONDING_MASTER=yes.

Example ifcfg-bond0 Interface Configuration File

An example of a channel bonding interface.

DEVICE=bond0
NAME=bond0
TYPE=Bond
BONDING_MASTER=yes

ONBOOT=yes

BOOTPROTO=none
IPADDR=192.168.1.1
PREFIX=24

# BONDING_OPTS="*bonding parameters separated by spaces*"
# sets the configuration parameters for the bonding device
BONDING_OPTS="mode=4 miimon=100 lacp_rate=fast xmit_hash_policy=layer3+4"
# ETHTOOL_OPTS where options are any device-specific options supported by ethtool. 
ETHTOOL_OPTS="-K bond1 tso off gso off lro off"

The NAME directive is useful for naming the connection profile in NetworkManager. ONBOOT says whether the profile should be started when booting (or more generally, when auto-connecting a device).

Important

Parameters for the bonding kernel module must be specified as a space-separated list in the BONDING_OPTS="*bonding parameters*" directive in the ifcfg-bond*N* interface file. Do not specify options for the bonding device in /etc/modprobe.d/*bonding*.conf, or in the deprecated /etc/modprobe.conf file.

The max_bonds parameter is not interface specific and should not be set when using ifcfg-bond*N* files with the BONDING_OPTS directive as this directive will cause the network scripts to create the bond interfaces as required.

For further instructions and advice on configuring the bonding module and to view the list of bonding parameters, see Section 7.7, “Using Channel Bonding”.

thinking case

基于上文的试验bond1

目标:把eth1从bond1中剔除

操作:
1.rm -rf /etc/sysconfig/network-scripts/ifcfg-eth1
2.service network restart

结果eth1还是在bond里:

[root@TENCENT64 /proc/net/bonding]# cat /proc/net/bonding/bond1 | grep eth1
Slave Interface: eth1

dmesg -T | tail -n 20

### 以下是shutdown网卡的dmesg
[Thu Dec 15 10:47:45 2022] bond1: Releasing backup interface eth0
[Thu Dec 15 10:47:45 2022] bond1: the permanent HWaddr of eth0 - 6c:92:bf:be:64:84 - is still in use by bond1 - set the HWaddr of eth0 to a different address to avoid conflicts -> 此时bond1中还有一个slave eth1, bond1还是active, 并且仍在使用6c:92:bf:be:64:84,而这是eth0的实际mac地址,所以给eth0先set一个临时mac
[Thu Dec 15 10:47:45 2022] i40e 0000:1a:00.0 eth0: already using mac address 6c:92:bf:be:64:84 -> eth0已经down掉,此时拿回原来的mac地址,不会冲突
### 以下是up网卡的dmesg
[Thu Dec 15 10:47:46 2022] 8021q: adding VLAN 0 to HW filter on device bond1
[Thu Dec 15 10:47:46 2022] i40e 0000:1a:00.0 eth0: already using mac address 6c:92:bf:be:64:84
[Thu Dec 15 10:47:46 2022] 8021q: adding VLAN 0 to HW filter on device eth0
[Thu Dec 15 10:47:46 2022] bond1: Enslaving eth0 as a backup interface with an up link

dmesg内完全没有eth1的信息,说明restart并未对eth1做任何操作,使得它仍然是bond1的slave

而神奇的是,如果把第二步service network restart换成reboot,那么bond1就只有eth0一个slave。也就是说,重启机器可以work,重启网卡不work

为什么呢?个人认为,

  • 重启机器,是在kernel管理的内存中初始化bond配置
  • 重启网卡,是用当前新配置去刷新系统内存中的bond配置。/etc/sysconfig/network-scripts/ifcfg-eth1被删除,所以并没有eth1的新配置可以拿去刷新系统内存中的bond配置,因此bond配置中关于eth1的部分保持原样

你可能感兴趣的:(nic bond 2022-12-13)