版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 (作者:张华 发表于:2018-09-03)
现实生活中的悖论真多, 本来pmtud是设计用来在mtu不一致的情况下协商mss值的, 结果很多服务端或者中间路由器会错误地禁用掉icmp-type=3或者icmp=type=4, 于是ptmud不可用, 于是很多路由器中的clamp-mss-to-pmtu设置(iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu)也失效, 这样tcp访问某些特定mtu值不一致的网站时就会出现各种莫名其妙的问题.
另一方面, 对于udp, 因为无连接, 所以无法协商mss, 这样有些网络设备(eg: Nokia7)会设置禁止pmtud协商(disable DF bit in the IP header of the sender, ip_no_pmtu_disc=1), 从而也会反过来造成路由器中的clamp-mss-to-pmtu失效.
注: 我家wan mtu=1484 (注意: 自从openwrt上游又接tplink router之后, openwrt wan mtu=1500, tplink wan mtu=1480, 需要将openwrt上的mss相关的iptables去掉吗?)
笔者最近应该是遇到了常听大家说起的wifi断流问题, 新入一款安卓原生系统手机, 但是在使用wifi上网时会感觉到某些APP上网不流畅, 尤其是使用京东APP搜索商品时会总说找不着网络, 但此时显然是有网络的. 为此, 笔者先做了一系统排除性实验:
上述一系列排除性测试让我相信该问题仅和我使用特定的手机型号, 使用特定的OpenWRT路由器, 使用特定的某些APP如京东有关.
京东APP, 一个上层应用而已, 理论上只有下列几个因素会影响到上层应用:
#iptables -A FORWARD -j ACCEPT
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
root@OpenWrt:~# iptables-save |grep mss
:mssfix - [0:0]
-A FORWARD -j mssfix
-A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
-A mssfix -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -m comment --comment "wan (mtu_fix)" -j TCPMSS --clamp-mss-to-pmtu
root@OpenWrt:~# tcpdump -ni br-lan src host 192.168.99.194 and dst host 111.13.24.129 and dst port 443
06:37:42.085674 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [S], seq 140081180, win 65535, options [mss 1460,sackOK,TS val 8629819 ecr 0,nop,wscale 8], length 0
06:37:42.092397 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [.], ack 2370816066, win 343, length 0
06:37:42.095245 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [P.], seq 0:173, ack 1, win 343, length 173
06:37:42.141194 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [.], ack 1453, win 354, length 0
06:37:42.141396 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [.], ack 2905, win 365, length 0
06:37:42.141536 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [.], ack 3472, win 377, length 0
06:37:42.147373 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [P.], seq 173:491, ack 3472, win 377, length 318
06:37:42.185607 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [P.], seq 491:1736, ack 3714, win 388, length 1245
06:37:42.194932 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [P.], seq 1736:1767, ack 4065, win 388, length 31
06:37:42.195258 IP 192.168.99.194.39494 > 111.13.24.129.443: Flags [R.], seq 1767, ack 4065, win 388, length 0
这款手机的操作系统没有设置ip_no_pmtu_disc参数去协商mss值, 而OpenWRT路由器刚好缺一条iptables rule (iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu), 这样遭遇了pppoe的1492 MTU问题.
换句话说, 当我外出时, 如果所连的路由器没有加这条设置, 那么这个问题仍然又遇到. 手机操作系统ip_no_pmtu_disc设置才能彻底解决某些应用wifi网络不能上网的问题.
iptables -t mangle -I POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
iptables -t mangle -I OUTPUT -o pppoe-wan -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
iptables -t mangle -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -I POSTROUTING -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -I OUTPUT -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -I FORWARD -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
另外, 要想clamp-mss-to-pmtu生效, 需要设置充许设计DF=1来禁用分片从而启用pmtud协议.
echo 0 >/proc/sys/net/ipv4/ip_no_pmtu_disc
但设置"echo 0 >/proc/sys/net/ipv4/ip_no_pmtu_disc"后之后, 对于无连接的udp包, 由于无法协商mss值, 当大分片的udp包到达路由器后, 路由器会简单地丢弃它.
openstack虚机上再创建docker容器, 因为是vxlan网络虚机的mtu=1450, docker0的mtu如果为1500时将导致docker container无法通信.
当mtu配置不正确时此时需要依赖mss, 此时计算节点充当的是路由器它发现此情况(不能将docker container发出的mtu=1500的包发出时)应该发ICMP error到docker container, 或者openstack vrouter应当做这件事情.
docker container错误的具体表现是无法下载, 此时, 最可能的情况时, 最大的包到达openstack vrouter后, 它从external interface收到mtu=1500的包并尝试分片发给虚机, 失败后将发ICMP error到外部下载服务器. 此时:
1, 当non-DVR时, l3-agent上的snat-xxx确实向外部下载服务器发出了ICMP error:
11:45:04.873959 fa:16:3e:c5:b3:ed > fe:54:00:36b4, ethertype IPv4 (0x0800), length 590: 100.64.1.1 > 120.146.233.220: ICMP 103.245.215.9 unreachable - need to frag (mtu 1450), length 556
2, 当DVR时, FIP移到了计算节点上, 此时是qrouter-xxx, 但是这个ns没有default route,
root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
57481: from 103.245.215.14 lookup 16
80000: from 103.245.215.0/28 lookup 16
root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c route show table 16
default via 169.254.106.115 dev rfp-1752c73a-b
root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c route show
103.245.215.0/28 dev qr-ec03268e-fb proto kernel scope link src 103.245.215.1
169.254.106.114/31 dev rfp-1752c73a-b proto kernel scope link src 169.254.106.114
只有再添加下列默认路由之后, DVR qrouter-xxx’s才能将ICMP error发出去, 这样才可能去使用mss,
root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c route add default via 169.254.106.115 dev rfp-1752c73a-b
所以这样造成的问题就是, non-DVR虚机上运行docker container没问题, 而DVR虚机上运行docker container有问题. 解决办法有三个:
1, 修改docker0的mtu=1450, 我们不能修改bridge的mtu, 但可以往docker0里再添加一个tap, 这样bridge的mtu将取决于tap mtu的最小值.
2, 计算节点上运行:
iptables -t mangle -A FORWARD -o ens3 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:65495 -j TCPMSS --clamp-mss-to-pmtu
3, to increase the global-physnet-mtu to 1550 to allow the real tenant network MTU to be 1500.
注: 沿路的交换机也要配置相应的MTU, 特别是如果交换机配置的MTU过小, 那么ICMP error直接就没有提示就被drop( 千万注意: 过来的包的MTU如果大于设备的MTU值才会分片分段, 如果是小于的话直接就DROP掉了, 这个参数net.ipv4.tcp_mtu_probing也可以探测这种情况)了这样导致MTU没配对之后也无法利用MSS, 这种问题更不好查. 另外, 也发现IPv6情况下的ICMPv6没有被ip6table rule允许的情况.
我的mtu是1484=1456+28(IP header is 20 byptes, ICMP header is 8 bytes), gavin的mtu是1492. 这个网页(https://ubuntuforums.org/showthread.php?t=2341699)说:
we are using a PPPoE connection the highest MTU we can get is 1492. And your ISP might not even go that high. The highest I could get mine was 1484. 这个网页(https://www.gargoyle-router.com/phpbb/viewtopic.php?t=8787)也提到了pppoe-wan口的mtu有时总比设置的要少8 byptes.
root@OpenWrt:~# ip addr show pppoe-wan | grep mtu
608: pppoe-wan: mtu 1484 qdisc fq_codel state UNKNOWN group default qlen 3
hua@t440p:~$ ping -M do -i 1 -c 2 -s 1456 vps
1464 bytes from vps (xx.xx.xx.xx.xx): icmp_seq=1 ttl=57 time=209 ms
hua@t440p:~$ ping -M do -i 1 -c 2 -s 1457 vps
ping: local error: Message too long, mtu=1484
ubuntu@gavin-P70:~$ ping -M do -i 1 -c 2 -s 1464 vps
1472 bytes from xx.xx.xx.xx.xx: icmp_seq=1 ttl=58 time=48.5 ms
ubuntu@gavin-P70:~$ ping -M do -i 1 -c 2 -s 1465 vps
ping: local error: Message too long, mtu=1492
抓包看到的gavin的mss值是1452(1492-40), 我的mss值却是1460.
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 src host 192.168.99.135 and dst host xx.xx.xx.xx and dst port 22 and '(ip and ip[20+13] & tcp-syn != 0)'
08:00:45.771287 IP 192.168.99.135.35172 > xx.xx.xx.xx.22: Flags [S], seq 1056384328, win 29200, options [mss 1460,sackOK,TS val 92453102 ecr 0,nop,wscale 7], length 0
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 src host xx.xx.xx.xx and '(ip and ip[20+13] & tcp-syn != 0)'
08:01:31.252313 IP xx.xx.xx.xx.22 > 192.168.99.135.35182: Flags [S.], seq 1913298511, ack 4239124226, win 28960, options [mss 1452,sackOK,TS val 3432015413 ecr 92498502,nop,wscale 7], length 0
为什么我的mss值却是1460呢? 可能是根据t440p client上的mtu 1500来设置的, 却没有参考路由器上的1484. 查看了路由器的设置是使用了"–clamp-mss-to-pmtu", 如果对端如gavin家不支持pmtu协商呢(如他家的tplink路由器什么的没有设置ip_no_pmtu_disc=0).
root@OpenWrt:~# iptables-save |grep mss
:mssfix - [0:0]
-A FORWARD -j mssfix
-A mssfix -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -m comment --comment "wan (mtu_fix)" -j TCPMSS --clamp-mss-to-pmtu
只有继续添加了下列针对POSTROUTING的clamp-mss-to-pmtu才变回1444
root@OpenWrt:~# iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 src host xx.xx.xx.xx and '(ip and ip[20+13] & tcp-syn != 0)'
08:51:53.831560 IP xx.xx.xx.xx.22 > 192.168.99.135.36554: Flags [S.], seq 3638522412, ack 307088327, win 28960, options [mss 1444,sackOK,TS val 3432771057 ecr 95521097,nop,wscale 7], length 0
所以在路由器上设置最终为:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
iptables -t mangle -A OUTPUT -o pppoe-wan -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -A FORWARD -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -A OUTPUT -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
ip6tables -t mangle -A POSTROUTING -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
改了之后, 从路由器的**pppoe-wan口(出是1444, 进是1452)和t440p的eth0口(出是1444, 进是1460)**看到的mss值是不一样的
root@OpenWrt:~# tcpdump -s0 -p -ni pppoe-wan src host xx.xx.xx.xx and '(ip and ip[20+13] & tcp-syn != 0)'
01:45:59.915458 IP xx.xx.xx.xx.22 > 10.2.156.119.37466: Flags [S.], seq 2070487858, ack 2656232535, win 28960, options [mss 1452,sackOK,TS val 3433582582 ecr 98767213,nop,wscale 7], length 0
root@OpenWrt:~# tcpdump -s0 -p -ni eth0 src host 192.168.99.135 and dst host xx.xx.xx.xx and dst port 22 and '(ip and ip[20+13] & tcp-syn != 0)'
01:45:59.835236 IP 10.2.156.119.37466 > xx.xx.xx.xx.22: Flags [S], seq 2656232534, win 29200, options [mss 1444,sackOK,TS val 98767213 ecr 0,nop,wscale 7], length 0
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 src host xx.xx.xx.xx and '(ip and ip[20+13] & tcp-syn != 0)'
09:45:59.925637 IP xx.xx.xx.xx.22 > 192.168.99.135.37466: Flags [S.], seq 2070487858, ack 2656232535, win 28960, options [mss 1444,sackOK,TS val 3433582582 ecr 98767213,nop,wscale 7], length 0
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 dst host xx.xx.xx.xx and dst port 22 and '(ip and ip[20+13] & tcp-syn != 0)'
09:45:59.842593 IP 192.168.99.135.37466 > xx.xx.xx.xx.22: Flags [S], seq 2656232534, win 29200, options [mss 1460,sackOK,TS val 98767213 ecr 0,nop,wscale 7], length 0
如果不在路由器, 改在t440p上单为ssh设置的话可以:
# https://blog.cloudflare.com/path-mtu-discovery-in-practice/
# sudo ip route change default via <> advmss 1444
sudo iptables -t mangle -A OUTPUT -p tcp --dport 22 -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1444
另外, 也可以将t440p上的mtu设置为1484, 这样路由器和t440p上进和出的mss值全统一到1444了.
sudo ifconfig eth0 mtu 1484
只有两端的话可以通过mss协商解决, 但两端之间还有一系统路由器交换机什么的就需要pmtud来协商mss值了. pmtud的原理就是设置df=1禁止分片但又传大分片给路由器, 路由器根据df=1应该丢弃掉包并返回 icmp unreachable messages, 但有些路由器的icmp规则设置的过于严格(远端机器无法ping, "sudo mtr xxx -r --no-dns -P 22"的输出的最后一跳丢包率是100%, 100%的丢包率不一定是真的丢包, 而是icmp限速了或者禁用了) 可以会丢弃icmp unreachable messages. 至少应该设置:
# https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
iptables -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT
iptables -A INPUT -p icmp -m icmp --icmp-type 4 -j ACCEPT
iptables -A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT
如此好的一篇文章 (https://www.zeitgeist.se/2013/11/26/mtu-woes-in-ipsec-tunnels-how-to-fix/ ), 相见恨晚. 这篇文章中提到了两点和我之前长时间痛苦摸索的基本一致:
减少client端的mss值, 并查看mss值
# https://blog.cloudflare.com/path-mtu-discovery-in-practice/
# sudo ip route change default via <> advmss 1400
# sudo iptables -t mangle -A OUTPUT -p tcp --dport 22 -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1444
hua@t440p:~$ sudo iptables-save |grep mss
-A OUTPUT -p tcp -m tcp --dport 22 -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1444
hua@t440p:~$ sudo tcpdump -s0 -p -ni eth0 src host 192.168.99.135 and dst host xxx and dst port 22 and '(ip and ip[20+13] & tcp-syn != 0)'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
22:05:34.295036 IP 192.168.99.135.55502 > xxx.22: Flags [S], seq 2332107979, win 29200, options [mss 1444,sackOK,TS val 56741396 ecr 0,nop,wscale 7], length 0
Enable smart MTU black hole detection. RFC4821 proposes a mechanism to detect ICMP black holes and tries to adjust the path MTU in a smart way. To enable this on Linux type:
echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
echo 1024 > /proc/sys/net/ipv4/tcp_base_mss
检测丢包(使用ping不准的):
hua@t440p:~$ netstat -s -p|grep -i segments
11519400 segments received
11624725 segments sent out
84339 segments retransmitted
980 bad segments received
ssh总断不是上面mtu的原因, 是因为使用白名单模式时没有将这些要ssh的ip排除RETURN.