Linux bonding将多个物理网卡绑定为一个逻辑网卡。逻辑网卡成为Master,绑定的物理网卡成为Slave。
需要bonding内核模块和 ifenslave工具,有些Linux发行版本默认并不编译bonding,需要重新编译内核。可喜的是,SUSE10 SP2版本默认自带了bonding模块和ifenslave工具,SUSE10 SP2的bonding版本为V3.0.3.
可以使用modinfo 命令查看发行版本是否存在bonding内核模块及版本:
#modinfo bonding
Miimon选项 或者 arp_interval 和arp_ip_target选项必须要配置一个,否则,在某个网卡故障的时候会对网络性能有严重影响。大部分网卡都支持miimon,检察一个网卡是否支持miimon,可以使用如下方法:
ethtool eth0 (eth0为网卡名,根据实际情况输入网卡名)
在输出结果中,如果有如下信息,
Link detected: yes
则表示网卡支持miimon。
Specifies one of the bonding policies. The default is balance-rr (round robin). Possible values are:
Round-robin policy: Transmit packets in sequential order from the first available slave through the last. This mode provides load balancing and fault tolerance.
Active-backup policy: Only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch.
In bonding version 2.6.2 or later, when a failover occurs in active-backup mode, bonding will issue one or more gratuitous ARPs on the newly active slave. One gratutious ARP is issued for the bonding master interface and each VLAN interfaces configured above it, provided that the interface has at least one IP address configured. Gratuitous ARPs issued for VLAN interfaces are tagged with the appropriate VLAN id.
This mode provides fault tolerance. The primary option, documented below, affects the behavior of this mode.(如果primary选项指定一个slave为primary,则只要此slave可用,它将一直是active slave。只有当primary slave故障时,才将其他slave置为active。这在各个slave的性能有差异时是有用的)
XOR policy: 工作方式取决于选择的hash policy,通过xmit_hash_policy选项配置。 The default policy is a simple [(source MAC address XOR'd with destination MAC address) modulo slave count].
This mode provides load balancing and fault tolerance.
Broadcast policy: transmits everything on all slave interfaces. This mode provides fault tolerance.
IEEE 802.3ad Dynamic link aggregation. Creates aggregation groups that share the same speed and duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification.
Slave selection for outgoing traffic is done according to the transmit hash policy, which may be changed from the default simple XOR policy via the xmit_hash_policy option, documented below. Note that not all transmit policies may be 802.3ad compliant, particularly in regards to the packet mis-ordering requirements of section 43.2.4 of the 802.3ad standard. Differing peer implementations will have varying tolerances for noncompliance.(slave的选择取决于hash policy,通过xmit_hash_policy配置,注意:并不是所有的hash policy都使用于802.3ad)
Prerequisites(先决条件):
1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave.
2. A switch that supports IEEE 802.3ad Dynamic link aggregation.
Most switches will require some type of configuration to enable 802.3ad mode.
Adaptive transmit load balancing: channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.
Prerequisite:
Ethtool support in the base drivers for retrieving the speed of each slave.
Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts(拦截) the ARP Replies sent by the local system on their way out and overwrites(改写)the source hardware address with the unique hardware address of one of the slaves in the bond such that(使得满足...的条件) different peers use different hardware addresses for the server.
Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer's IP information from the ARP packet. When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that(如此…以致)the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated. The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond.
When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch.
Prerequisites:
1. Ethtool support in the base drivers for retrieving the speed of each slave.
2. Base driver support for setting the hardware address of a device while it is open. This is required so that there will always be one slave in the team using the bond hardware address (the curr_active_slave) while having a unique hardware address for each slave in the bond. If the curr_active_slave fails its hardware address is swapped with the new curr_active_slave that was chosen.
ARP链路检测的频率,单位是毫秒。
ARP monitor会周期性的向arp_ip_target选项配置的目的IP发送ARP消息,检测slave的状态。具体的行为与arp_validate选项相关(此选项从3.1.0版本才开始有)。
If ARP monitoring is used in an etherchannel compatible mode(modes 0 and 2), the switch should be configured in a mode that evenly(均匀的)distributes packets across all links. If the switch is configured to distribute the packets in an XOR fashion, all replies from the ARP targets will be received on the same link which could cause the other team members to fail. ARP monitoring should not be used in conjunction with miimon. A value of 0 disables ARP monitoring. The default value is 0.
设置ARP monitor的目的IP,当arp_interval > 0时有效。当arp_interval > 0时必须至少配置一个IP,可以配置多个IP使用逗号分隔,最多可配置16个。默认值是没有IP。
This option was added in bonding version 3.1.0. Specifies whether or not ARP probes and replies should be validated in the active-backup mode. This causes the ARP monitor to examine the incoming ARP requests and replies, and only consider a slave to be up if it is receiving the appropriate ARP traffic.
Possible values are:
none or 0
No validation is performed. This is the default.
active or 1
Validation is performed only for the active slave.
backup or 2
Validation is performed only for backup slaves.
all or 3
Validation is performed for all slaves.
For the active slave, the validation checks ARP replies to confirm that they were generated by an arp_ip_target. Since backup slaves do not typically receive these replies, the validation performed for backup slaves is on the ARP request sent out via the active slave. It is possible that some switch or network configurations may result in situations wherein the backup slaves do not receive the ARP requests; in such a situation, validation of backup slaves must be disabled.
This option is useful in network configurations in which multiple bonding hosts are concurrently issuing ARPs to one or more targets beyond a common switch. Should the link between the switch and target fail (but not the switch itself), the probe traffic generated by the multiple bonding instances will fool the standard ARP monitor into considering the links as still up. Use of the arp_validate option can resolve this, as the ARP monitor will only consider ARP requests and replies associated with its own instance of bonding.
检测到一个链路故障后停止此slave前的等待时间,单位是毫秒。此选项仅对miimon检测方式有效。这个值必须是miimon 值的倍数,如果不是,则取不超过此值的miimon最大倍数值。默认值是0。
检测到一个链路恢复后启动此slave前的等待时间,单位是毫秒。此选项仅对miimon检测方式有效。这个值必须是miimon 值的倍数,如果不是,则取不超过此值的miimon最大倍数值。默认值是0。
Option specifying the rate in which we'll ask our link partner to transmit LACPDU(链路聚合控制协议数据单元,LACP协议的数据包) packets in 802.3ad mode. Possible values are:
slow or 0
Request partner to transmit LACPDUs every 30 seconds
fast or 1
Request partner to transmit LACPDUs every 1 second
The default is slow.
Specifies the number of bonding devices to create for this
instance of the bonding driver. E.g., if max_bonds is 3, and
the bonding driver is not already loaded, then bond0, bond1
and bond2 will be created. The default value is 1.
MII链路检测的频率,单位是毫秒。这个值决定了检测slave状态的频率,0表示禁忌MII方式的链路检测,若启用MII,此值最好大于等于100. use_carrier选项会影响链路状态的判定方式。See the High Availability section for additional information. The default value is 0.
Specifies whether or not miimon should use MII or ETHTOOL ioctls vs. netif_carrier_ok() to determine the link status. The MII or ETHTOOL ioctls are less efficient and utilize a deprecated calling sequence within the kernel. The netif_carrier_ok() relies on the device driver to maintain its state with netif_carrier_on/off; at this writing, most, but not all, device drivers support this facility.
If bonding insists that the link is up when it should not be, it may be that your network device driver does not support netif_carrier_on/off. The default state for netif_carrier is "carrier on," so if a driver does not support netif_carrier, it will appear as if the link is always up. In this case, setting use_carrier to 0 will cause bonding to revert to the MII / ETHTOOL ioctl method to determine the link state.
A value of 1 enables the use of netif_carrier_ok(), a value of 0 will use the deprecated MII / ETHTOOL ioctls. The default value is 1.
A string (eth0, eth2, etc) specifying which slave is the primary device. The specified device will always be the active slave while it is available. Only when the primary is off-line will alternate devices be used. This is useful when one slave is preferred over another, e.g., when one slave has higher throughput than another.
一个字符串(例如eth0, eth2, etc)指定slave是primary。只要此slave可用,它将一直是active slave。只有当primary slave故障时,才将其他slave置为active。这在各个slave的性能有差异时是有用的。The primary option is only valid for active-backup mode.
Selects the transmit hash policy to use for slave selection in balance-xor and 802.3ad modes. Possible values are:
layer2
Uses XOR of hardware MAC addresses to generate the hash. The formula is (source MAC XOR destination MAC) modulo slave count This algorithm will place all traffic to a particular network peer on the same slave. This algorithm is 802.3ad compliant.
layer3+4
This policy uses upper layer protocol information, when available, to generate the hash. This allows for traffic to a particular network peer to span(跨越) multiple slaves, although a single connection will not span multiple slaves.
N/A
For this section, "switch" refers to whatever system the bonded devices are directly connected to (i.e., where the other end of the cable plugs into). This may be an actual dedicated switch device, or it may be another regular system (e.g., another computer running Linux)。
The active-backup, balance-tlb and balance-alb modes do not require any specific configuration of the switch.(不需要配置交换机)
The 802.3ad mode requires that the switch have the appropriate ports configured as an 802.3ad aggregation. The precise method used to configure this varies from switch to switch, but, for example, a Cisco 3550 series switch requires that the appropriate ports first be grouped together in a single etherchannel instance, then that etherchannel is set to mode "lacp" to enable 802.3ad (instead of standard EtherChannel). (需要配置交换机)
The balance-rr, balance-xor and broadcast modes generally require that the switch have the appropriate ports grouped together. The nomenclature(术语)for such a group differs between switches, it may be called an "etherchannel" (as in the Cisco example, above), a "trunk group" or some other similar variation. For these modes, each switch will also have its own configuration options for the switch's transmit policy to the bond. Typical choices include XOR of either the MAC or IP addresses. The transmit policy of the two peers does not need to match. For these three modes, the bonding mode really selects a transmit policy for an EtherChannel group; all three will interoperate with another EtherChannel group.(需要配置交换机)
The bonding driver at present supports two schemes for monitoring a slave device's link state: the ARP monitor and the MII monitor.(提供两种检测链路状态的方式:the ARP monitor and the MII monitor)
At the present time, due to implementation restrictions in the bonding driver itself, it is not possible to enable both ARP and MII monitoring simultaneously.(当前的版本中不能同时启用ARP 和 MII检测)。
The ARP monitor operates as its name suggests: it sends ARP queries to one or more designated(指定的)peer systems on the network, and uses the response as an indication that the link is operating. This gives some assurance that traffic is actually flowing to and from one or more peers on the local network.(发送ARP查询到配置的目的端,根据目的端的响应判断链路的状态)
The ARP monitor relies on the device driver itself to verify that traffic is flowing. In particular, the driver must keep up to date the last receive time, dev->last_rx, and transmit start time, dev->trans_start. If these are not updated by the driver, then the ARP monitor will immediately fail any slaves using that driver, and those slaves will stay down. If networking monitoring (tcpdump, etc) shows the ARP requests and replies on the network, then it may be that your device driver is not updating last_rx and trans_start.(ARP monitor 依靠网卡驱动本身来确定查询结果,网卡驱动必须更新最后一次收包的时间和发送开始时间dev->last_rx和dev->trans_start。如果网卡驱动不更新这些值,ARP monitor将会终止使用此驱动的网卡。)
While ARP monitoring can be done with just one target, it can be useful in a High Availability setup to have several targets to monitor. In the case of just one target, the target itself may go down or have a problem making it unresponsive to ARP requests. Having an additional target (or several) increases the reliability of the ARP monitoring.
Multiple ARP targets must be separated by commas as follows:
# example options for ARP monitoring with three targets
alias bond0 bonding
options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
For just a single target the options would resemble:
# example options for ARP monitoring with one target
alias bond0 bonding
options bond0 arp_interval=60 arp_ip_target=192.168.0.100
The MII monitor monitors only the carrier state of the local network interface. It accomplishes this in one of three ways: by depending upon the device driver to maintain its carrier state, by querying the device's MII registers, or by making an ethtool query to the device. (只检测本地网卡接口的状态。有三种方式:网卡驱动维护状态,查询MII注册表,使用ethtool查询)。
If the use_carrier module parameter is 1 (the default value), then the MII monitor will rely on the driver for carrier state information (via the netif_carrier subsystem). As explained in the use_carrier parameter information, above, if the MII monitor fails to detect carrier loss on the device (e.g., when the cable is physically disconnected), it may be that the driver does not support netif_carrier.(如何判断网卡驱动是否支持netif_carrier)
If use_carrier is 0, then the MII monitor will first query the device's (via ioctl) MII registers and check the link state. If that request fails (not just that it returns carrier down), then the MII monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain the same information. If both methods fail (i.e., the driver either does not support or had some error in processing both the MII register and ethtool requests), then the MII monitor will assume the link is up.
When bonding is configured, it is important that the slave devices not have routes that supercede(取代)routes of the master (or, generally, not have routes at all). For example, suppose the bonding device bond0 has two slaves, eth0 and eth1, and the routing table is
as follows:
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
This routing configuration will likely still update the receive/transmit times in the driver (needed by the ARP monitor), but may bypass the bonding driver (because outgoing traffic to, in this case, another host on network 10 would use eth0 or eth1 before bond0).
The ARP monitor (and ARP itself) may become confused by this configuration, because ARP requests (generated by the ARP monitor) will be sent on one interface (bond0), but the corresponding reply will arrive on a different interface (eth0). This reply looks to ARP as an unsolicited ARP reply (because ARP matches replies on an interface basis), and is discarded. The MII monitor is not affected by the state of the routing table.
The solution here is simply to insure that slaves do not have routes of their own, and if for some reason they must, those routes do not supercede routes of their master. This should generally be the case, but unusual configurations or errant manual or automatic static route additions may cause trouble.
On systems with network configuration scripts that do not associate physical devices directly with network interface names (so that the same physical device always has the same "ethX" name), it may be necessary to add some special logic to either /etc/modules.conf or /etc/modprobe.conf (depending upon which is installed on the system).
For example, given a modules.conf containing the following:
alias bond0 bonding
options bond0 mode=some-mode miimon=50
alias eth0 tg3
alias eth1 tg3
alias eth2 e1000
alias eth3 e1000
If neither eth0 and eth1 are slaves to bond0, then when the bond0 interface comes up, the devices may end up reordered. This happens because bonding is loaded first, then its slave device's drivers are loaded next. Since no other drivers have been loaded, when the e1000 driver loads, it will receive eth0 and eth1 for its devices, but the bonding configuration tries to enslave eth2 and eth3 (which may later be assigned to the tg3 devices).
Adding the following:
add above bonding e1000 tg3
causes modprobe to load e1000 then tg3, in that order, when bonding is loaded. This command is fully documented in the modules.conf manual page.
On systems utilizing modprobe.conf (or modprobe.conf.local), an equivalent problem can occur. In this case, the following can be added to modprobe.conf (or modprobe.conf.local, as appropriate), as follows (all on one line; it has been split here for clarity):
install bonding /sbin/modprobe tg3; /sbin/modprobe e1000;
/sbin/modprobe --ignore-install bonding
This will, when loading the bonding module, rather than performing the normal action, instead execute the provided command. This command loads the device drivers in the order needed, then calls modprobe with --ignore-install to cause the normal action to then take place. Full documentation on this can be found in the modprobe.conf and modprobe manual pages.
By default, bonding enables the use_carrier option, which instructs bonding to trust the driver to maintain carrier state.
As discussed in the options section, above, some drivers do not support the netif_carrier_on/_off link state tracking system. With use_carrier enabled, bonding will always see these links as up, regardless of their actual state.
Additionally, other drivers do support netif_carrier, but do not maintain it in real time, e.g., only polling the link state at some fixed interval. In this case, miimon will detect failures, but only after some long period of time has expired. If it appears that miimon is very slow in detecting link failures, try specifying use_carrier=0 to see if that improves the failure detection time. If it does, then it may be that the driver checks the carrier state at a fixed interval, but does not cache the MII register values (so the use_carrier=0 method of querying the registers directly works). If use_carrier=0 does not improve the failover, then the driver may cache the registers, or the problem may be elsewhere.
Also, remember that miimon only checks for the device's carrier state. It has no way to determine the state of devices on or beyond other ports of a switch, or if a switch is refusing to pass traffic while still maintaining carrier on.(miimon只检测网卡连接的状态,并不能确定交换机上另一个网口的状态;对于交换机拒绝发送数据但仍保持连接正常的情况也无法检测)
(如果机器上运行有SNMP agents,需要在加载网卡驱动之前首先加载bonding driver)。
If running SNMP agents, the bonding driver should be loaded before any network drivers participating in a bond.(该如何先加载呢?) This requirement is due to the interface index (ipAdEntIfIndex) being associated to the first interface found with a given IP address. That is, there is only one ipAdEntIfIndex for each IP address. For example, if eth0 and eth1 are slaves of bond0 and the driver for eth0 is loaded before the bonding driver, the interface for the IP address will be associated with the eth0 interface. This configuration is shown below, the IP address 192.168.1.1 has an interface index of 2 which indexes to eth0 in the ifDescr table (ifDescr.2).
interfaces.ifTable.ifEntry.ifDescr.1 = lo
interfaces.ifTable.ifEntry.ifDescr.2 = eth0
interfaces.ifTable.ifEntry.ifDescr.3 = eth1
interfaces.ifTable.ifEntry.ifDescr.4 = eth2
interfaces.ifTable.ifEntry.ifDescr.5 = eth3
interfaces.ifTable.ifEntry.ifDescr.6 = bond0
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
This problem is avoided by loading the bonding driver before any network drivers participating in a bond. Below is an example of loading the bonding driver first, the IP address 192.168.1.1 is correctly associated with ifDescr.2.
interfaces.ifTable.ifEntry.ifDescr.1 = lo
interfaces.ifTable.ifEntry.ifDescr.2 = bond0
interfaces.ifTable.ifEntry.ifDescr.3 = eth0
interfaces.ifTable.ifEntry.ifDescr.4 = eth1
interfaces.ifTable.ifEntry.ifDescr.5 = eth2
interfaces.ifTable.ifEntry.ifDescr.6 = eth3
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
While some distributions may not report the interface name in ifDescr, the association between the IP address and IfIndex remains and SNMP functions such as Interface_Scan_Next will report that association.
(网卡接收所有的网络包,而不仅仅是发送给本机的数据包)
When running network monitoring tools, e.g., tcpdump, it is common to enable promiscuous mode on the device, so that all traffic is seen (instead of seeing only traffic destined for the local host). The bonding driver handles promiscuous mode changes to the bonding master device (e.g., bond0), and propagates the setting to the slave devices.
For the balance-rr, balance-xor, broadcast, and 802.3ad modes, the promiscuous mode setting is propagated to all slaves.
For the active-backup, balance-tlb and balance-alb modes, the promiscuous mode setting is propagated only to the active slave.
For balance-tlb mode, the active slave is the slave currently receiving inbound traffic.
For balance-alb mode, the active slave is the slave used as a "primary." This slave is used for mode-specific control traffic, for sending to peers that are unassigned or if the load is unbalanced.
For the active-backup, balance-tlb and balance-alb modes, when the active slave changes (e.g., due to a link failure), the promiscuous setting will be propagated to the new active slave.
High Availability refers to configurations that provide maximum network availability by having redundant or backup devices, links or switches between the host and the rest of the world. The goal is to provide the maximum availability of network connectivity (i.e., the network always works), even though other configurations could provide higher throughput.
If two hosts (or a host and a single switch) are directly connected via multiple physical links, then there is no availability penalty to optimizing for maximum bandwidth. In this case, there is only one switch (or peer), so if it fails, there is no alternative access to fail over to. Additionally, the bonding load balance modes support link monitoring of their members, so if individual links fail, the load will be rebalanced across the remaining devices.
See Section 13, "Configuring Bonding for Maximum Throughput" for information on configuring bonding with one peer device.
(连在同一个交换机上不需要考虑带宽受损的问题,在负载均衡模式下,一个链路坏了,可以将负载调度到剩余的链路上)
(只适用于active-backup 和 broadcast modes)
With multiple switches, the configuration of bonding and the network changes dramatically. In multiple switch topologies, there is a trade off (权衡) between network availability and usable bandwidth.(连接到多个交换机上需要在可用性和可用带宽之间做一个权衡)
Below is a sample network, configured to maximize the availability of the network:
| |
|port3 port3|
+-----+----+ +-----+----+
| |port2 ISL port2| |
| switch A +--------------------------+ switch B |
| | | |
+-----+----+ +-----++---+
|port1 port1|
| +-------+ |
+-------------+ host1 +---------------+
eth0 +-------+ eth1
In this configuration, there is a link between the two switches (ISL, or inter switch link,交换链路内协议,思科专有协议。类似的国际标准协议是IEEE802.1Q协议,与ISL不兼容), and multiple ports connecting to the outside world ("port3" on each switch). There is no technical reason that this could not be extended to a third switch.
-------------------------------------------------------------
In a topology such as the example above, the active-backup and broadcast modes are the only useful bonding modes when optimizing for availability; the other modes require all links to terminate on the same peer for them to behave rationally(理性的).(只适用于active-backup and broadcast modes)
active-backup: This is generally the preferred mode, particularly if the switches have an ISL and
play together well. If the network configuration is such that one switch is specifically a backup switch (e.g., has lower capacity, higher cost, etc), then the primary option can be used to insure that the preferred link is always used when it is available.
broadcast: This mode is really a special purpose mode, and is suitable only for very specific needs.
For example, if the two switches are not connected (no ISL), and the networks beyond them are totally independent. In this case, if it is necessary for some specific one-way traffic to reach both independent networks, then the broadcast mode may be suitable.(broadcast模式会将包发送到所有的网卡。对于单向传输数据到多个独立网络的情形,broadcast是合适的。)
The choice of link monitoring ultimately(根本,最后) depends upon your switch. If the switch can reliably fail ports in response to other failures, then either the MII or ARP monitors should work. For example, in the above example, if the "port3" link fails at the remote end, the MII monitor has no direct means to detect this. The ARP monitor could be configured with a target at the remote end of port3, thus detecting that failure without switch support.(MII monitor不能检测到交换机对端的连接错误,ARP可以)
In general, however, in a multiple switch topology, the ARP monitor can provide a higher level of reliability in detecting end to end connectivity failures (which may be caused by the failure of any individual component to pass traffic for any reason). Additionally, the ARP monitor should be configured with multiple targets (at least one for each switch in the network). This will insure that, regardless of which switch is active, the ARP monitor has a suitable target to query.(ARP monitor最好配多个目标IP,每个交换机一个)
In a single switch configuration, the best method to maximize throughput depends upon the application and network environment. The various load balancing modes each have strengths and weaknesses in different environments, as detailed below.
For this discussion, we will break down the topologies into two categories. Depending upon the destination of most traffic, we categorize them into either "gatewayed" or "local" configurations.
In a gatewayed configuration, the "switch" is acting primarily as a router, and the majority of traffic passes through this router to other networks. An example would be the following:
+----------+ +----------+
| |eth0 port1| | to other networks
| Host A +---------------------+ router +------------------->
| +---------------------+ | Hosts B and C are out
| |eth1 port2| | here somewhere
+----------+ +----------+
The router may be a dedicated(专用的)router device, or another host acting as a gateway. For our discussion, the important point is that the majority of traffic from Host A will pass through the router to some other network before reaching its final destination.(router可以是专用路由器也可能是具有网关功能的主机)
In a gatewayed network configuration, although Host A may communicate with many other systems, all of its traffic will be sent and received via one other peer on the local network, the router.
Note that the case of two systems connected directly via multiple physical links is, for purposes of configuring bonding, the same as a gatewayed configuration. In that case, it happens that all traffic is destined for the "gateway" itself, not some other network beyond the gateway.
In a local configuration, the "switch" is acting primarily as a switch, and the majority of traffic passes through this switch to reach other stations on the same network. An example would be the following:
+----------+ +----------+ +--------+
| |eth0 port1| +-------+ Host B |
| Host A +------------+ switch |port3 +--------+
| +------------+ | +--------+
| |eth1 port2| +------------------+ Host C |
+----------+ +----------+port4 +--------+
Again, the switch may be a dedicated switch device, or another host acting as a gateway. For our discussion, the important point is that the majority of traffic from Host A is destined for other hosts on the same local network (Hosts B and C in the above example). (switch可以是专用交换机也可能是具有网关功能的主机)
In summary, in a gatewayed configuration, traffic to and from the bonded device will be to the same MAC level peer on the network (the gateway itself, i.e., the router), regardless of its final
destination. In a local configuration, traffic flows directly to and from the final destinations, thus, each destination (Host B, Host C) will be addressed directly by their individual MAC addresses.
This distinction between a gatewayed and a local network configuration is important because many of the load balancing modes available use the MAC addresses of the local network source and destination to make load balancing decisions. The behavior of each mode is described below.(这种区分是重要的,因为很多负载均衡模式是通过本地网络的源和目的MAC地址做负载选择的)
This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your needs. The trade offs(权衡)for each mode are detailed below:
balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe
traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput.(这是唯一的模式,允许同一个TCP连接的数据通过不同的网卡接收和发送,使他能够利用多个网卡的吞吐能力) This comes at a cost, however: the striping often results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.(产生副作用:容易导致对端接收到的包乱序,使拥塞控制失效,导致重传)
It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The usual default value is 3, and the maximum useful value is 127. For a four interface balance-rr bond, expect that a single TCP/IP stream will utilize no more than approximately 2.3 interface's worth of throughput, even after adjusting tcp_reordering.
Note that this out of order delivery occurs when both the sending and receiving systems are utilizing a multiple interface bond(乱序发送的情况仅当发送和接收端都使用端口bond时才发生。若发送端多网卡绑定,接收端没有绑定,则接收端收到的包不会乱序;若发送端不绑定,接收端多网卡绑定则接收端可能出现乱序,这取决于交换机的功能。). Consider a configuration in which a balance-rr bond feeds into a single higher capacity network channel (e.g., multiple 100Mb/sec ethernets feeding a single gigabit ethernet via an etherchannel capable switch). In this configuration, traffic sent from the multiple 100Mb devices to a destination connected to the gigabit device will not see packets out of order. However, traffic sent from the gigabit device to the multiple 100Mb devices may or may not see traffic out of order, depending upon the balance policy of the switch. Many switches do not support any modes that stripe traffic (instead choosing a port based upon IP or MAC level addresses); for those devices, traffic flowing from the gigabit device to the many 100Mb devices will only utilize one interface.(但愿我们使用的是这种交换机)
If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram performance that scales near linearly as interfaces are added to the bond.
This mode requires the switch to have the appropriate ports configured for "etherchannel" or "trunking."(需要交换机上配置etherchannel)
active-backup: There is not much advantage in this network topology to
the active-backup mode, as the inactive backup devices are all connected to the same peer as the primary. In this case, a load balancing mode (with link monitoring) will provide the same level of network availability, but with increased available bandwidth. On the plus side, active-backup mode does not require any configuration of the switch, so it may have value if the hardware available does not support any of the load balance modes.(此模式没多少益处,唯一的好处是不需要配置交换机,一个网卡坏了,另一个网卡还能工作。)
balance-xor: This mode will limit traffic such that(如此…以致)packets destined
for specific peers will always be sent over the same interface. Since the destination is determined by the MAC addresses involved, this mode works best in a "local" network configuration (as described above), with destinations all on the same local network. This mode is likely to be suboptimal(次最优的)if all your traffic is passed through a single router (i.e., a "gatewayed" network configuration, as described above).(此模式对于server和client在同一个局域网络里的情况是合适的)
As with balance-rr, the switch ports need to be configured for "etherchannel" or "trunking."(需要交换机上配置etherchannel)
broadcast: Like active-backup, there is not much advantage to this mode in this type of network
topology.
802.3ad: This mode can be a good choice for this type of network
topology. The 802.3ad mode is an IEEE standard, so all peers that implement 802.3ad should interoperate well. The 802.3ad protocol includes automatic configuration of the aggregates, so minimal manual configuration of the switch is needed (typically only to designate that some set of devices is available for 802.3ad). The 802.3ad standard also mandates(授权,托管)that frames be delivered in order (within certain limits), so in general single connections will not see misordering of packets.(不会出现乱序的情况,这一点比较好) The 802.3ad mode does have some drawbacks: the standard mandates that all devices in the aggregate operate at the same speed and duplex(网卡具有相同的速度和全双工). Also, as with all bonding load balance modes other than balance-rr, no single connection will be able to utilize more than a single interface's worth of bandwidth. (一个连接只能利用一个网卡的带宽)
Additionally, the linux bonding 802.3ad implementation distributes traffic by peer (using an XOR of MAC addresses), so in a "gatewayed" configuration, all outgoing traffic will generally use the same device. Incoming traffic may also end up on a single device, but that is dependent upon the balancing policy of the peer's 8023.ad implementation. In a "local" configuration, traffic will be distributed across the devices in the bond.
Finally, the 802.3ad mode mandates the use of the MII monitor, therefore, the ARP monitor is not available in this mode.(不能使用ARP monitor)
balance-tlb: The balance-tlb mode balances outgoing traffic by peer.
Since the balancing is done according to MAC address, in a "gatewayed" configuration (as described above), this mode will send all traffic across a single device. However, in a "local" network configuration, this mode balances multiple local network peers across devices in a vaguely intelligent manner (not a simple XOR as in balance-xor or 802.3ad mode), so that mathematically unlucky MAC addresses (i.e., ones that XOR to the same value) will not all "bunch up" on a single interface.
Unlike 802.3ad, interfaces may be of differing speeds, and no special switch configuration is required. On the down side, in this mode all incoming traffic arrives over a single interface, this mode requires certain ethtool support in the network device driver of the slave interfaces, and the ARP monitor is not available.
balance-alb: This mode is everything that balance-tlb is, and more.
It has all of the features (and restrictions) of balance-tlb, and will also balance incoming traffic from local network peers (as described in the Bonding Module Options section, above).
The only additional down side to this mode is that the network device driver must support changing the hardware address while the device is open.
The choice of link monitoring may largely depend upon which mode you choose to use. The more advanced load balancing modes do not support the use of the ARP monitor, and are thus restricted to using the MII monitor (which does not provide as high a level of end to end assurance as the ARP monitor).
Multiple switches may be utilized to optimize for throughput when they are configured in parallel as part of an isolated network between two or more systems, for example:
+-----------+
| Host A |
+-+---+---+-+
| | |
+--------+ | +---------+
| | |
+------+---+ +-----+----+ +-----+----+
| Switch A | | Switch B | | Switch C |
+------+---+ +-----+----+ +-----+----+
| | |
+--------+ | +---------+
| | |
+-+---+---+-+
| Host B |
+-----------+
In this configuration, the switches are isolated from one another. One reason to employ a topology such as this is for an isolated network with many hosts (a cluster configured for high performance, for example), using multiple smaller switches can be more cost effective than a single larger switch, e.g., on a network with 24 hosts, three 24 port switches can be significantly less expensive than a single 72 port switch.
If access beyond the network is required, an individual host can be equipped with an additional network device connected to an external network; this host then additionally acts as a gateway.
In actual practice, the bonding mode typically employed in configurations of this type is balance-rr. Historically, in this network configuration, the usual caveats about out of order packet delivery are mitigated by the use of network adapters that do not do any kind of packet coalescing (via the use of NAPI, or because the device itself does not generate interrupts until some number of packets has arrived). When employed in this fashion, the balance-rr mode allows individual connections between two hosts to effectively utilize greater than one interface's bandwidth.
Again, in actual practice, the MII monitor is most often used in this configuration, as performance is given preference over availability. The ARP monitor will function in this topology, but its advantages over the MII monitor are mitigated by the volume of probes needed as the number of systems involved grows (remember that each host in the network is configured with bonding).
Some switches exhibit undesirable behavior with regard to the timing of link up and down reporting by the switch.
First, when a link comes up, some switches may indicate that the link is up (carrier available), but not pass traffic over the interface for some period of time. This delay is typically due to some type of autonegotiation or routing protocol, but may also occur during switch initialization (e.g., during recovery after a switch failure). If you find this to be a problem, specify an appropriate value to the updelay bonding module option to delay the use of the relevant interface(s).
Second, some switches may "bounce" the link state one or more times while a link is changing state. This occurs most commonly while the switch is initializing. Again, an appropriate updelay value may help.
Note that when a bonding interface has no active links, the driver will immediately reuse the first link that goes up, even if the updelay parameter has been specified (the updelay is ignored in this case). If there are slave interfaces waiting for the updelay timeout to expire, the interface that first went into that state will be immediately reused. This reduces down time of the network if the value of updelay has been overestimated, and since this occurs only in cases with no connectivity, there is no additional penalty for ignoring the updelay.
In addition to the concerns about switch timings, if your switches take a long time to go into backup mode, it may be desirable to not activate a backup interface immediately after a link goes down. Failover may be delayed via the downdelay bonding module option.
It is not uncommon to observe a short burst of duplicated traffic when the bonding device is first used, or after it has been idle for some period of time. This is most easily observed by issuing a "ping" to some other host on the network, and noticing that the output from ping flags duplicates (typically one per slave).
For example, on a bond in active-backup mode with five slaves all connected to one switch, the output may appear as follows:
# ping -n 10.0.4.2
PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
This is not due to an error in the bonding driver, rather, it is a side effect of how many switches update their MAC forwarding tables. Initially, the switch does not associate the MAC address in the packet with a particular switch port, and so it may send the traffic to all ports until its MAC forwarding table is updated. Since the interfaces attached to the bond may occupy multiple ports on a single switch, when the switch (temporarily) floods the traffic to all ports, the bond device receives multiple copies of the same packet (one per slave device).
The duplicated packet behavior is switch dependent, some switches exhibit this, and some do not. On switches that display this behavior, it can be induced by clearing the MAC forwarding table (on most Cisco switches, the privileged command "clear mac address-table dynamic" will accomplish this).