前两天新配置了几台虚拟机,像往常一样安装集群,但是在flannel这一步总是有问题,查看日志:
Failed to find any valid interface to use: failed to get default interface: Unable to find default route
没找到有效的网卡设备,解决办法也很简单,在flannel的启动参数中指定网卡即可:
--iface="": interface to use (IP or name) for inter-host communication. Defaults to the interface for the default route on the machine. This can be specified multiple times to check each option in order. Returns the first match found.
--iface-regex="": regex expression to match the first interface to use (IP or name) for inter-host communication. If unspecified, will default to the interface for the default route on the machine. This can be specified multiple times to check each regex in order. Returns the first match found. This option is superseded by the iface option and will only be used if nothing matches any option specified in the iface options.
指定网卡参数可以指定确定的网卡名称,也可以使用正则表达式匹配,而我做的是一个通用的部署脚本(https://github.com/QingyaFan/kubernetes-cluster-creator),要具有通用性,不能预知网卡名称,所以指定网卡设备的正则表达式更适合这个场景,而一般的网卡是eth*,或者enp*,所以可以在flannel的启动参数中加入--iface-regex="eth*|enp*"。
但是,奇怪的是,为啥之前一直没有遇到过这个问题,为什么这次遇到了呢?去看了下flannel的源码(main.go),第一个错误涉及到的地方是:
// Check the default interface only if no interfaces are specified
if len(opts.iface) == 0 && len(opts.ifaceRegex) == 0 {
extIface, err = LookupExtIface("", "")
if err != nil {
log.Error("Failed to find any valid interface to use: ", err)
os.Exit(1)
}
}
可以看到,如果没有指定 --iface或者--iface-regex参数,那么flannel会主动去寻找外部的网卡设备:LookupExtIface,如果没有找到,就会报出文章一开始遇到的错误:“Failed to find any valid interface to use”,后续的错误信息是LookupExtIface执行失败报出的错误,在LookupExtIface我们可以看到其执行逻辑:
func LookupExtIface(ifname string, ifregex string) (*backend.ExternalInterface, error) {
// ...
if len(ifname) > 0 {
// ...
} else if len(ifregex) > 0 {
// ...
} else {
log.Info("Determining IP address of default interface")
if iface, err = ip.GetDefaultGatewayIface(); err != nil {
return nil, fmt.Errorf("failed to get default interface: %s", err)
}
}
// ...
}
这里调用了ip.GetDefaultGatewayIface(),获取默认的gateway失败,并且报出了“failed to get default interface”的错误,最终是因为ip.GetDefaultGatewayIface()的失败导致的。GetDefaultGatewayIface的定义如下:
func GetDefaultGatewayIface() (*net.Interface, error) {
routes, err := netlink.RouteList(nil, syscall.AF_INET)
if err != nil {
return nil, err
}
for _, route := range routes {
if route.Dst == nil || route.Dst.String() == "0.0.0.0/0" {
if route.LinkIndex <= 0 {
return nil, errors.New("Found default route but could not determine interface")
}
return net.InterfaceByIndex(route.LinkIndex)
}
}
return nil, errors.New("Unable to find default route")
}
可能的原因是netlink.RouteList获取到了空结果,从而返回了错误:return nil, errors.New("Unable to find default route")。接下来我们看一下netlink.RouteList的执行逻辑,看是什么原因导致的返回空结果。
github上clone了一下vishvananda/netlink的源码,找到了RouteList函数的定义(route_linux.go 684 ~ 702):
// RouteList gets a list of routes in the system.
// Equivalent to: `ip route show`.
// The list can be filtered by link and ip family.
func RouteList(link Link, family int) ([]Route, error) {
return pkgHandle.RouteList(link, family)
}
// RouteList gets a list of routes in the system.
// Equivalent to: `ip route show`.
// The list can be filtered by link and ip family.
func (h *Handle) RouteList(link Link, family int) ([]Route, error) {
var routeFilter *Route
if link != nil {
routeFilter = &Route{
LinkIndex: link.Attrs().Index,
}
}
return h.RouteListFiltered(family, routeFilter, RT_FILTER_OIF)
}
先复现问题,使用netlink,写了一小段程序,编译成linux环境的可执行程序:`GOOS=linux GOARCH=amd64 go build app.go`,发送到服务器执行,确实返回了空:
package main
import (
"errors"
"fmt"
"net"
"syscall"
"github.com/vishvananda/netlink"
)
func GetDefaultGatewayIface() (*net.Interface, error) {
routes, err := netlink.RouteList(nil, syscall.AF_INET)
if err != nil {
return nil, err
}
for _, route := range routes {
if route.Dst == nil || route.Dst.String() == "0.0.0.0/0" {
if route.LinkIndex <= 0 {
return nil, errors.New("Found default route but could not determine interface")
}
return net.InterfaceByIndex(route.LinkIndex)
}
}
return nil, errors.New("Unable to find default route")
}
func main() {
fmt.Println(GetDefaultGatewayIface())
}
注释中写着该函数相当于命令:`ip route show`,于是在服务器的节点执行该命令,同样没有“default via”的字样,那么问题应该在服务器的网络设置上,服务器的centos都是最小化安装,网络配置都是在`/etc/sysconfig/network-scripts/ifcfg-enp*`中,检查发现GATEWAY写成了GETWAY,这真的是!哎,改过来吧,于是就可以了。
讲道理,如果GATEWAY配置错误,是不能访问外网的,但是由于脚本是离线安装的,所以没有在意服务器是否能访问外网。就此,问题得以解决,又可以使用离线安装脚本安装kubernetes集群了。
有对只使用shell脚本离线安装kubernetes集群脚本感兴趣的可以看看: https://github.com/QingyaFan/kubernetes-cluster-creator