最近一周都在解决filebeat dns解析失败的问题。filebeat通过daemonset方式部署在k8s集群中,从而收集整个主机pods的日志。在主机os为centos7.4 的版本集群中,没有任何问题。但是os为centos7.6的集群中,却出现了解析dns失败,导致日志无法发送到kafka集群。
查看filebeat错误日志如下:
Failed to connect to broker sg.main2.kafka.metis.service:9092: dial tcp: lookup sg.main2.kafka.metis.service: Try again
于是开启了debug过程,首先怀疑是coredns出了问题,去exec到pod中进行dig。
dig @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service
; <<>> DiG 9.12.4-P2 <<>> @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44350
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;sg.main2.kafka.metis.service. IN A
;; ANSWER SECTION:
sg.main2.kafka.metis.service. 30 IN A [10.21.42.97](10.21.42.97)
;; Query time: 1 msec
;; SERVER: [10.247.3.10](10.247.3.10)#53([10.247.3.10](10.247.3.10))
;; WHEN: Sun Jan 05 14:13:26 UTC 2020
;; MSG SIZE rcvd: 101
pod中是可以正常解析的,那么问题可以定位到代码了。
这个时候需要strace出马了。
发现filebeat 在向127.0.0.1 53 去做dns解析。结果可想而知,解析失败。
需要对应一下golang源码了。
// Copyright 2009 The Go Authors. All rights reserved.
2// Use of this source code is governed by a BSD-style
3// license that can be found in the LICENSE file.
4
5// +build aix darwin dragonfly freebsd linux netbsd openbsd solaris
6
7// Read system DNS config from /etc/resolv.conf
8
9package net
10
11import (
12 "internal/bytealg"
13 "os"
14 "sync/atomic"
15 "time"
16)
17
18var (
19 defaultNS = []string{"127.0.0.1:53", "[::1]:53"}
20 getHostname = os.Hostname // variable for testing
21)
22
23type dnsConfig struct {
24 servers []string // server addresses (in host:port form) to use
25 search []string // rooted suffixes to append to local name
26 ndots int // number of dots in name to trigger absolute lookup
27 timeout time.Duration // wait before giving up on a query, including retries
28 attempts int // lost packets before giving up on server
29 rotate bool // round robin among servers
30 unknownOpt bool // anything unknown was encountered
31 lookup []string // OpenBSD top-level database "lookup" order
32 err error // any error that occurs during open of resolv.conf
33 mtime time.Time // time of resolv.conf modification
34 soffset uint32 // used by serverOffset
35 singleRequest bool // use sequential A and AAAA queries instead of parallel queries
36 useTCP bool // force usage of TCP for DNS resolutions
37}
38
39// See resolv.conf(5) on a Linux machine.
40func dnsReadConfig(filename string) *dnsConfig {
41 conf := &dnsConfig{
42 ndots: 1,
43 timeout: 5 * time.Second,
44 attempts: 2,
45 }
46 file, err := open(filename)
47 if err != nil {
48 conf.servers = defaultNS
49 conf.search = dnsDefaultSearch()
50 conf.err = err
51 return conf
52 }
53 defer file.close()
54 if fi, err := file.file.Stat(); err == nil {
55 conf.mtime = fi.ModTime()
56 } else {
57 conf.servers = defaultNS
58 conf.search = dnsDefaultSearch()
59 conf.err = err
60 return conf
61 }
62 for line, ok := file.readLine(); ok; line, ok = file.readLine() {
63 if len(line) > 0 && (line[0] == ';' || line[0] == '#') {
64 // comment.
65 continue
66 }
67 f := getFields(line)
68 if len(f) < 1 {
69 continue
70 }
71 switch f[0] {
72 case "nameserver": // add one name server
73 if len(f) > 1 && len(conf.servers) < 3 { // small, but the standard limit
74 // One more check: make sure server name is
75 // just an IP address. Otherwise we need DNS
76 // to look it up.
77 if parseIPv4(f[1]) != nil {
78 conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
79 } else if ip, _ := parseIPv6Zone(f[1]); ip != nil {
80 conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
81 }
82 }
83
84 case "domain": // set search path to just this domain
85 if len(f) > 1 {
86 conf.search = []string{ensureRooted(f[1])}
87 }
88
89 case "search": // set search path to given servers
90 conf.search = make([]string, len(f)-1)
91 for i := 0; i < len(conf.search); i++ {
92 conf.search[i] = ensureRooted(f[i+1])
93 }
94
95 case "options": // magic options
96 for _, s := range f[1:] {
97 switch {
98 case hasPrefix(s, "ndots:"):
99 n, _, _ := dtoi(s[6:])
100 if n < 0 {
101 n = 0
102 } else if n > 15 {
103 n = 15
104 }
105 conf.ndots = n
106 case hasPrefix(s, "timeout:"):
107 n, _, _ := dtoi(s[8:])
108 if n < 1 {
109 n = 1
110 }
111 conf.timeout = time.Duration(n) * time.Second
112 case hasPrefix(s, "attempts:"):
113 n, _, _ := dtoi(s[9:])
114 if n < 1 {
115 n = 1
116 }
117 conf.attempts = n
118 case s == "rotate":
119 conf.rotate = true
120 case s == "single-request" || s == "single-request-reopen":
121 // Linux option:
122 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
123 // "By default, glibc performs IPv4 and IPv6 lookups in parallel [...]
124 // This option disables the behavior and makes glibc
125 // perform the IPv6 and IPv4 requests sequentially."
126 conf.singleRequest = true
127 case s == "use-vc" || s == "usevc" || s == "tcp":
128 // Linux (use-vc), FreeBSD (usevc) and OpenBSD (tcp) option:
129 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
130 // "Sets RES_USEVC in _res.options.
131 // This option forces the use of TCP for DNS resolutions."
132 // https://www.freebsd.org/cgi/man.cgi?query=resolv.conf&sektion=5&manpath=freebsd-release-ports
133 // https://man.openbsd.org/resolv.conf.5
134 conf.useTCP = true
135 default:
136 conf.unknownOpt = true
137 }
138 }
139
140 case "lookup":
141 // OpenBSD option:
142 // https://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/resolv.conf.5
143 // "the legal space-separated values are: bind, file, yp"
144 conf.lookup = f[1:]
145
146 default:
147 conf.unknownOpt = true
148 }
149 }
150 if len(conf.servers) == 0 {
151 conf.servers = defaultNS
152 }
153 if len(conf.search) == 0 {
154 conf.search = dnsDefaultSearch()
155 }
156 return conf
157}
158
159// serverOffset returns an offset that can be used to determine
160// indices of servers in c.servers when making queries.
161// When the rotate option is enabled, this offset increases.
162// Otherwise it is always 0.
163func (c *dnsConfig) serverOffset() uint32 {
164 if c.rotate {
165 return atomic.AddUint32(&c.soffset, 1) - 1 // return 0 to start
166 }
167 return 0
168}
169
170func dnsDefaultSearch() []string {
171 hn, err := getHostname()
172 if err != nil {
173 // best effort
174 return nil
175 }
176 if i := bytealg.IndexByteString(hn, '.'); i >= 0 && i < len(hn)-1 {
177 return []string{ensureRooted(hn[i+1:])}
178 }
179 return nil
180}
181
182func hasPrefix(s, prefix string) bool {
183 return len(s) >= len(prefix) && s[:len(prefix)] == prefix
184}
185
186func ensureRooted(s string) string {
187 if len(s) > 0 && s[len(s)-1] == '.' {
188 return s
189 }
190 return s + "."
191}
由于我们同样的代码在centos7.4版本的集群中,运行没有问题,所以怀疑是基础镜像alpine3.8和centos 7.6存在某些兼容性的问题。
我们知道golang dns解析支持cgo和purego两种模式。那可能是某些设置导致golang 通过cgo去解析,然后alpine 使用的是比较特殊的musl库。可能这个库和centos7.6 不兼容。
var lookupOrderName = map[hostLookupOrder]string{
hostLookupCgo: "cgo",
hostLookupFilesDNS: "files,dns",
hostLookupDNSFiles: "dns,files",
hostLookupFiles: "files",
hostLookupDNS: "dns",
}
其中hostLookupCgo
是一类,表示直接调用libc的getaddrinfo方法去解析。
域名解析函数,Dial函数会间接调用到,而LokupHost和LookupAddr则会直接调用域名解析函数,不同的操作系统实现不同, 在Unix系统中有两种方法进行域名解析:
- 纯GO语言实现的域名解析,从/etc/resolv.conf中取出本地dns server地址列表, 发送DNS请求(UDP报文)并获得结果
- 使用cgo方式, 最终会调用到c标准库的getaddrinfo或getnameinfo函数(不建议使用对GO协程不友好)
可以通过GODEBUG环境变量来设置go语言的默认DNS解析方式 纯go或cgo,
export GODEBUG=netdns=go # force pure Go resolver 纯go 方式
export GODEBUG=netdns=cgo # force cgo resolver cgo 方式
为了印证猜想,分析GO语言的域名解析流程,强制export GODEBUG=netdns=go+9,问题不出现,设置为export GODEBUG=netdns=cgo+9,问题出现,在go1.11的版本中会走到cgo流程.
然后在编译filebeat的时候禁用cgo,如下:
CGO_ENABLED=0 go build --ldflags -w -o filebeat
一劳永逸解决。
在go调用C函数入口(getaddrinfo)增加了打印,发现正常和异常的场景下,入参是一致的,但是到lib库中的行为与低版本操作系统存在差异,存在lib库兼容性问题。
结论
- 在alpine 环境中,go代码最好禁用cgo。
- 在k8s集群中,选取镜像最好是和主机os一致的分发版本。