原文链接:
https://github.com/tnaganawa/tungstenfabric-docs/blob/master/TungstenFabricKnowledgeBase.md
作者:Tatsuya Naganawa 译者:TF编译组
本系列为“Tungsten Fabric入门宝典”的姊妹篇,补充介绍有关Tungsten Fabric部署的各类主题。
vhost0设备
首次启动vRouter时,将创建vhost0接口,并将最初分配给物理接口的IP和MAC移至vhost0。
因此,自然假设的情况是,vhost0是vRouter本身,它对外部结Fabric进行ARP响应,流量首先通过vhost0,然后进入虚拟机。
transit traffic:
vm - vhost0 - eth0
self traffic:
vhost0 - eth0
实际上,事实并非如此。
作为说明,当在vhost0上对诸如VXLAN之类的overlay流量执行tcpdump时,它不会显示一些数据包,需要针对物理接口的tcpdump才能实现这个目的。
这个文档也有助于您理解:https://wiki.tungsten.io/display/TUN/Offloads?preview=%2F1409118%2F1409513%2FTungsten+Fabric+ParaVirt+Offload+Arch.DOCX
transit traffic:
vm - (dp-core) - eth0
self traffic:
vhost0 - (dp-core) - eth0
在由dp-core服务的某些桥接域(bridge-domain)中,vhost0与irb相似,而eth0是此桥接域中的L2接口之一。
- 在vRouter术语中,此状态称为“xconnect (cross-connect)”,就我的理解来说,它类似于桥接:https://github.com/tungstenfabric/tf-vrouter/blob/master/dp-core/vr_interface.c
- 桥接域(bridge-domain)是与Linux网桥类似的概念,它可以具有多个物理L2接口和一个内部L3接口。
因此,当eth0首次收到来自Fabric的ARP请求时,dp-core将基于最初分配给eth0的MAC地址返回ARP响应。
然后其它计算节点将向该vRouter节点发送一些流量,例如overlay流量或自流量(self-traffic)。
使用overlay流量时(基于udp端口或gre标头,它由dp-core标识),dp-core会剥离外部IP和标签,并进行VRF路由到标签所指示的特定VM。
- 使用L3 VXLAN时,它将基于L3 VRF中的路由表进行路由查找
- 使用MPLS时,标签本身会标识最终接口
当dp-core接收到自流量(self traffic)后,将在vhost_tx中使用hif_rx(后者又使用linux函数netif_rx,以skb作为参数)将流量发送到vRouter节点上的linux接口,即vhost0。
- https://github.com/tungstenfabric/tf-vrouter/blob/master/dp-core/vr_interface.c#L813
- https://github.com/tungstenfabric/tf-vrouter/blob/master/linux/vr_host_interface.c#L2380
- https://github.com/tungstenfabric/tf-vrouter/blob/master/linux/vr_host_interface.c#L228
因此,对于用于自流量(self-traffic)的rx / tx,数据包始终通过dp-core,而对于传输流量(transit traffic),则不会通过vhost0。
skb to vr_packet
Linux网络堆栈使用sk_buff作为数据包的内存存储。
而在dp-core中,则使用vr_packet,因此它们之间如何转换是一个有趣的主题。
为此,使用vp_os_packet函数。
- https://github.com/tungstenfabric/tf-vrouter/blob/master/include/vr_linux.h#L10
static inline struct sk_buff * vp_os_packet(struct vr_packet *pkt) { return CONTAINER_OF(cb, struct sk_buff, pkt); }
因此,实际上vr_packet是在skb结构中的某个位置定义的(sk_buff->cb,它是某些应用程序使用的成员变量)。从而,skb和vr_packet可以通过指针操作进行转换。
请注意,由于cb最大为48字节,因此vr_packet不能大于该数值。这里有一些关于此问题的讨论。
https://github.com/tungstenfabric/tf-vrouter/blob/master/include/vr_packet.h#L195-L198
/*
* NOTE: Please do not add any more fields without ensuring
* that the size is <= 48 bytes in 64 bit systems.
*/
vRouter创建的Linux接口
首次启动vrouter-agent容器时会创建多个接口,即使vrouter-agent停止,实际上也不会删除该接口。
出于什么目的使用它,是一个有趣的主题。
综上所述,vrouter.ko中的vif接口始终与相应的linux netdevice绑定,因此使用vif --create等创建一些vRouter接口,同时也将创建linux netdevice,这可以从ip link或ls /sys/class/net中看到。
来自“ip tuntap list”的一个例证。
[root@ip-172-31-12-55 ~]# ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: ens3 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
16: vhost0 inet 172.31.12.55/20 brd 172.31.15.255 scope global dynamic vhost0\ valid_lft 3118sec preferred_lft 3118sec
16: vhost0 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
17: pkt0 inet6 fe80::5094:6cff:fefb:42f7/64 scope link \ valid_lft forever preferred_lft forever
[root@ip-172-31-12-55 ~]# ip tuntap list
pkt0: tap
[root@ip-172-31-12-55 ~]#
因此从Linux的角度来看,pkt0实际上是一个Tap设备。
从某种意义上说,vif命令将使vRouter与某些由nova-vif-driver创建的Linux网络设备(例如tapxxxx-xxxx)建立vRouter接口,以使通过该设备的数据包被dp-core接收 。
因此,当CNI等发现与容器连接的Tap设备时,它将发送内部创建vif的vrouter-api,其名称与Tap设备相同,以便将进入Tap设备的数据包转发到vRouter(dp-core)。
启动vrouter-agent时会创建一些特殊设备,即vhost0,pkt0,pkt1,pkt2,pkt3。
如前所述,vhost0与dp-core的irb接口相似,因此在dp-core路由完成后,它将接收到vRouter节点本身的数据包。
由于vrouter-agent容器在启动时会创建/etc/sysconfig/network-scripts/{ifup-vhost,ifdown-vhost},因此它可以由ifup / ifdown直接控制,其内部类型为vif --add vhost0,可以直接在命令行中创建和删除它。
https://github.com/tungstenfabric/tf-container-builder/blob/master/containers/vrouter/base/network-functions-vrouter-kernel#L41
这里,pkt1,pkt2,pkt3是在vrouter_linux_init中的linux_pkt_dev_alloc里定义的接口,其中vrouter_linux_init是vrouter.ko的module_init。
- https://github.com/tungstenfabric/tf-vrouter/blob/master/linux/vr_host_interface.c#L2485
linux/vrouter_mod.c
module_init(vrouter_linux_init);
static int
linux_pkt_dev_alloc(void)
{
if (pkt_gro_dev == NULL) {
pkt_gro_dev = linux_pkt_dev_init("pkt1", &pkt_gro_dev_setup,
&pkt_gro_dev_rx_handler);
if (pkt_gro_dev == NULL) {
vr_module_error(-ENOMEM, __FUNCTION__, __LINE__, 0);
return -ENOMEM;
}
}
if (pkt_l2_gro_dev == NULL) {
pkt_l2_gro_dev = linux_pkt_dev_init("pkt3", &pkt_l2_gro_dev_setup,
&pkt_gro_dev_rx_handler);
if (pkt_l2_gro_dev == NULL) {
vr_module_error(-ENOMEM, __FUNCTION__, __LINE__, 0);
return -ENOMEM;
}
}
if (pkt_rps_dev == NULL) {
pkt_rps_dev = linux_pkt_dev_init("pkt2", &pkt_rps_dev_setup,
&pkt_rps_dev_rx_handler);
if (pkt_rps_dev == NULL) {
vr_module_error(-ENOMEM, __FUNCTION__, __LINE__, 0);
return -ENOMEM;
}
}
return 0;
}
它使用了一些GRO和RPS功能,这对于提高内核vRouter的性能很重要。
- 它们被初始化为空的net_device_ops和随机的ethernet addr。
linux/vr_host_interface.c
/*
* pkt_rps_dev_ops - netdevice operations on RPS packet device. Currently,
* no operations are needed, but an empty structure is required to
* register the device.
*
*/
static struct net_device_ops pkt_rps_dev_ops;
(snip)
/*
* pkt_rps_dev_setup - fill in the relevant fields of the RPS packet device
*/
static void
pkt_rps_dev_setup(struct net_device *dev)
{
/*
* Initializing the interfaces with basic parameters to setup address
* families.
*/
random_ether_addr(dev->dev_addr);
dev->addr_len = ETH_ALEN;
dev->hard_header_len = ETH_HLEN;
dev->type = ARPHRD_VOID;
dev->netdev_ops = &pkt_rps_dev_ops;
dev->mtu = 65535;
return;
}
这里pkt0稍有不同,它用于将数据包从dp-core发送到vrouter-agent。
实际上它是在vrouter-agent首次启动时,根据vrouter-agent的请求创建的,以创建与vrouter-agent进行通信的Tap设备。
- https://github.com/Juniper/contrail-controller/blob/master/src/vnsw/agent/contrail/contrail_agent_init.cc#L89
- https://github.com/Juniper/contrail-controller/blob/master/src/vnsw/agent/oper/interface.cc#L626
- https://github.com/tungstenfabric/tf-vrouter/blob/master/dp-core/vr_interface.c#L669
因此,如果将数据包从dp-core发送到该接口,则vrouter-agent将接收该数据包,以在内部处理该数据包(arp、dhcp等都以这种方式处理)。
作为描述此行为的另一说明,当modprobe vrouter,ifup vhost0,vrouter-agent启动完成后,我将添加ip -o addr,ip link,vif –list的结果。
# docker-compose -f /etc/contrail/vrouter/docker-compose.yaml down
# ifdown vhost0
# modprobe vrouter
[root@ip-172-31-12-55 ~]# ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: ens3 inet 172.31.12.55/20 brd 172.31.15.255 scope global dynamic ens3\ valid_lft 3561sec preferred_lft 3561sec
2: ens3 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# ip link
1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 06:6c:0b:c8:dd:64 brd ff:ff:ff:ff:ff:ff
3: docker0: mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:34:e8:c3:14 brd ff:ff:ff:ff:ff:ff
9: pkt1: <> mtu 65535 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/void be:f9:01:0e:4d:38 brd 00:00:00:00:00:00
10: pkt3: <> mtu 65535 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/void 46:f8:5c:cb:79:8e brd 00:00:00:00:00:00
11: pkt2: <> mtu 65535 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/void a2:b0:40:5c:03:d4 brd 00:00:00:00:00:00
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# vif --list
Vrouter Interface Table
Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, L=MAC Learning Enabled
Proxy=MAC Requests Proxied Always, Er=Etree Root, Mn=Mirror without Vlan Tag, HbsL=HBS Left Intf
HbsR=HBS Right Intf, Ig=Igmp Trap Enabled
vif0/4350 OS: pkt3
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
vif0/4351 OS: pkt1
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
[root@ip-172-31-12-55 ~]#
# ifup vhost0
[root@ip-172-31-12-55 ~]# ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: ens3 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
12: vhost0 inet 172.31.12.55/20 brd 172.31.15.255 scope global dynamic vhost0\ valid_lft 3594sec preferred_lft 3594sec
12: vhost0 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# ip link
1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 06:6c:0b:c8:dd:64 brd ff:ff:ff:ff:ff:ff
3: docker0: mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:34:e8:c3:14 brd ff:ff:ff:ff:ff:ff
9: pkt1: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void be:f9:01:0e:4d:38 brd 00:00:00:00:00:00
10: pkt3: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 46:f8:5c:cb:79:8e brd 00:00:00:00:00:00
11: pkt2: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void a2:b0:40:5c:03:d4 brd 00:00:00:00:00:00
12: vhost0: mtu 9001 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 06:6c:0b:c8:dd:64 brd ff:ff:ff:ff:ff:ff
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# vif --list
Vrouter Interface Table
Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, L=MAC Learning Enabled
Proxy=MAC Requests Proxied Always, Er=Etree Root, Mn=Mirror without Vlan Tag, HbsL=HBS Left Intf
HbsR=HBS Right Intf, Ig=Igmp Trap Enabled
vif0/2 OS: ens3 (Speed 10000, Duplex 1)
Type:Physical HWaddr:06:6c:0b:c8:dd:64 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:XTcL3L2Vp QOS:0 Ref:1
RX packets:54 bytes:13325 errors:0
TX packets:39 bytes:4452 errors:0
Drops:0
vif0/16 OS: vhost0
Type:Host HWaddr:06:6c:0b:c8:dd:64 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:XL3L2 QOS:0 Ref:1
RX packets:39 bytes:4452 errors:0
TX packets:54 bytes:13325 errors:0
Drops:0
vif0/4350 OS: pkt3
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
vif0/4351 OS: pkt1
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
[root@ip-172-31-12-55 ~]#
# docker-compose -f /etc/contrail/vrouter/docker-compose.yaml up -d
[root@ip-172-31-12-55 ~]# ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: ens3 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
16: vhost0 inet 172.31.12.55/20 brd 172.31.15.255 scope global dynamic vhost0\ valid_lft 3552sec preferred_lft 3552sec
16: vhost0 inet6 fe80::46c:bff:fec8:dd64/64 scope link \ valid_lft forever preferred_lft forever
17: pkt0 inet6 fe80::5094:6cff:fefb:42f7/64 scope link \ valid_lft forever preferred_lft forever
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# ip link
1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 06:6c:0b:c8:dd:64 brd ff:ff:ff:ff:ff:ff
3: docker0: mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:34:e8:c3:14 brd ff:ff:ff:ff:ff:ff
13: pkt1: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 36:72:98:97:9b:31 brd 00:00:00:00:00:00
14: pkt3: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 92:aa:52:e8:d5:c5 brd 00:00:00:00:00:00
15: pkt2: mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/void 42:b2:46:73:3d:6c brd 00:00:00:00:00:00
16: vhost0: mtu 9001 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 06:6c:0b:c8:dd:64 brd ff:ff:ff:ff:ff:ff
17: pkt0: mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 52:94:6c:fb:42:f7 brd ff:ff:ff:ff:ff:ff
[root@ip-172-31-12-55 ~]#
[root@ip-172-31-12-55 ~]# vif --list
Vrouter Interface Table
Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, L=MAC Learning Enabled
Proxy=MAC Requests Proxied Always, Er=Etree Root, Mn=Mirror without Vlan Tag, HbsL=HBS Left Intf
HbsR=HBS Right Intf, Ig=Igmp Trap Enabled
vif0/0 OS: ens3 (Speed 10000, Duplex 1) NH: 4
Type:Physical HWaddr:06:6c:0b:c8:dd:64 IPaddr:0.0.0.0
Vrf:0 Mcast Vrf:65535 Flags:TcL3L2VpEr QOS:-1 Ref:7
RX packets:165 bytes:97837 errors:0
TX packets:156 bytes:124911 errors:0
Drops:0
vif0/1 OS: vhost0 NH: 5
Type:Host HWaddr:06:6c:0b:c8:dd:64 IPaddr:172.31.12.55
Vrf:0 Mcast Vrf:65535 Flags:PL3DEr QOS:-1 Ref:8
RX packets:159 bytes:125878 errors:0
TX packets:192 bytes:98971 errors:0
Drops:7
vif0/2 OS: pkt0
Type:Agent HWaddr:00:00:5e:00:01:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3Er QOS:-1 Ref:3
RX packets:31 bytes:2666 errors:0
TX packets:34 bytes:13535 errors:0
Drops:0
vif0/4350 OS: pkt3
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
vif0/4351 OS: pkt1
Type:Stats HWaddr:00:00:00:00:00:00 IPaddr:0.0.0.0
Vrf:65535 Mcast Vrf:65535 Flags:L3L2 QOS:0 Ref:1
RX packets:0 bytes:0 errors:0
TX packets:0 bytes:0 errors:0
Drops:0
[root@ip-172-31-12-55 ~]#