本文主要针对第5点进行分析和解决。
sentinel://redis:26379,redis:26380?masterNames=mymaster&poolSize=100&poolName=xxx
nameserver 127.0.0.1
search aaa.bbb ostechnix.lan
(在这里,aaa.bbb是解析不了的)
$ ping redis
PING redis.ostechnix.lan (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.057 ms
$ nslookup redis
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: redis.ostechnix.lan
Address: 127.0.0.1
$ src/redis-cli -h redis -p 26379
redis:26379> sentinel sentinels mymaster
1) 1) "name"
2) "127.0.0.1:26380"
3) "ip"
4) "127.0.0.1"
5) "port"
6) "26380"
...
$ src/redis-cli -h redis -p 26380
1) 1) "name"
2) "127.0.0.1:26379"
3) "ip"
4) "127.0.0.1"
5) "port"
6) "26379"
java程序使用redisson-3.8.2尝试连接redis,出现了错误
Exception in thread "main" org.redisson.client.RedisConnectionException: At least two sentinels should be defined in Redis configuration!
at org.redisson.connection.SentinelConnectionManager.(SentinelConnectionManager.java:159)
at org.redisson.config.ConfigSupport.createConnectionManager(ConfigSupport.java:195)
at org.redisson.Redisson.(Redisson.java:122)
at org.redisson.Redisson.create(Redisson.java:161)
...
打开redisson的debug日志:
2018-11-21 18:11:53 [main] WARN o.r.c.SentinelConnectionManager - Can't connect to sentinel server. Unable to connect to: redis://redis:26379
...
2018-11-21 18:12:03 [main] WARN o.r.c.SentinelConnectionManager - Can't connect to sentinel server. Unable to connect to: redis://redis:26380
Well, 为何 redis-cli 能连接得上sentinel,而java程序会出错?java程序在使用以前的版本redisson-2.5.1的时候是一切正常的。
重新看一次debug日志,发现了一个奇怪的地方:
2018-11-21 18:11:48 [main] DEBUG i.netty.resolver.dns.DnsQueryContext - [id: 0xda4ab0e7] WRITE: [49889: /127.0.0.1:53], DefaultDnsQuestion(redis.aaa.bbb ostechnix.lan. IN A)
2018-11-21 18:11:48 [main] DEBUG i.netty.resolver.dns.DnsQueryContext - [id: 0xda4ab0e7] WRITE: [34575: /127.0.0.1:53], DefaultDnsQuestion(redis.aaa.bbb ostechnix.lan. IN AAAA)
解释一下:
IN A:代表主机名到 IPv4 地址的映射
IN AAAA:代表主机名到 IPv6 地址的映射
“ostechnix.lan.” 最后的点,代表根,“lan.” 表示lan为根下的第一级域
再来看这个域名 “redis.aaa.bbb ostechnix.lan.”,有点奇怪,域名中间为何会出现空格?
回头看/etc/resolv.conf文件,发现
search aaa.bbb ostechnix.lan
解释一下:
search:各项间以空格或者tab分隔,当域名没有以点结尾时,需要从这里追加各项,作为完全限定域名再发送DNS请求。
很明显,在解析search项的时候,没有用空格分隔开各项,导致DNS请求的域名存在错误。
查看redisson的源码,发现RedisClient的resolvAddr方法会对地址进行解析,如果/etc/resolv.conf里面存在多个DNS server的配置,会给每个配置都建立一个DnsNameResolver(这部分是属于netty-4.1.30.Final的源码)。
查看DnsNameResolver的源码:
static {
String[] searchDomains;
try {
List list = PlatformDependent.isWindows()
? getSearchDomainsHack()
: UnixResolverDnsServerAddressStreamProvider.parseEtcResolverSearchDomains();
searchDomains = list.toArray(new String[0]);
} catch (Exception ignore) {
// Failed to get the system name search domain list.
searchDomains = EmptyArrays.EMPTY_STRINGS;
}
DEFAULT_SEARCH_DOMAINS = searchDomains;
...
}
searchDomains 是通过UnixResolverDnsServerAddressStreamProvider.parseEtcResolverSearchDomains() 来解析的
查看该方法源码:
static List parseEtcResolverSearchDomains(File etcResolvConf) throws IOException {
String localDomain = null;
List searchDomains = new ArrayList();
FileReader fr = new FileReader(etcResolvConf);
BufferedReader br = null;
try {
br = new BufferedReader(fr);
String line;
while ((line = br.readLine()) != null) {
if (localDomain == null && line.startsWith(DOMAIN_ROW_LABEL)) {
int i = indexOfNonWhiteSpace(line, DOMAIN_ROW_LABEL.length());
if (i >= 0) {
localDomain = line.substring(i);
}
} else if (line.startsWith(SEARCH_ROW_LABEL)) {
int i = indexOfNonWhiteSpace(line, SEARCH_ROW_LABEL.length());
if (i >= 0) {
searchDomains.add(line.substring(i));
}
}
}
} finally {
if (br == null) {
fr.close();
} else {
br.close();
}
}
Well,看来是netty对于search的解析有了新的想法,认为search是每项一行,所以木有再对每行进行空格或者tab的切割
searchDomains.add(line.substring(i));
把 UnixResolverDnsServerAddressStreamProvider 的源码 copy 到应用中,按照相同的package路径放置,然后修改 parseEtcResolverSearchDomains 方法,对每行进行split
修改/etc/resolv.conf文件,把search改成每行一项,譬如:
nameserver 127.0.0.1
search aaa.bbb
search ostechnix.lan
如果你自己搭建了DNS server来模拟上述的实验,有可能还是出错,说连接不了sentinel。
以实验为例,每个DnsNameResolver在上述改动后,都会拿到2个domain(aaa.bbb 和 ostechnix.lan)。
你可以尝试改变这2个domain的顺序,譬如:
search ostechnix.lan aaa.bbb
或者
search ostechnix.lan
search aaa.bbb
再一次实验,就会发现这次竟然通过了。
Well,是不是很神奇?
调试一下 DnsResolveContext的 resolve方法
searchDomainPromise.addListener(new FutureListener>() {
private int searchDomainIdx = initialSearchDomainIdx;
@Override
public void operationComplete(Future> future) {
Throwable cause = future.cause();
if (cause == null) {
promise.trySuccess(future.getNow());
} else {
if (DnsNameResolver.isTransportOrTimeoutError(cause)) {
promise.tryFailure(new SearchDomainUnknownHostException(cause, hostname));
} else if (searchDomainIdx < searchDomains.length) {
Promise> newPromise = parent.executor().newPromise();
newPromise.addListener(this);
doSearchDomainQuery(hostname + '.' + searchDomains[searchDomainIdx++], newPromise);
} else if (!startWithoutSearchDomain) {
internalResolve(hostname, promise);
} else {
promise.tryFailure(new SearchDomainUnknownHostException(cause, hostname));
}
}
}
});
顺带说一下,这个方法会一个一个searchDomain的去尝试。
着重调试:
promise.tryFailure(new SearchDomainUnknownHostException(cause, hostname));
你会发现,当发送DNS请求(redis.aaa.bbb)到自己的DNS时,会出现5000ms的超时错误。
可以在命令行进行尝试:
$ nslookup redis.aaa.bbb 127.0.0.1
Server: 127.0.0.1
Address: 127.0.0.1#53
** server can't find redis.aaa.bbb: SERVFAIL
忽略这个错误,注意测量一下耗时,是不是早就过了5s?
结合上面的代码,当出现timeout错误时,下一个domain是不会继续去连接的,所以当顺序为"[aaa.bbb, ostechnix.lan]"时,程序一样报错。
那这个5000ms的限制是哪里加入的咧?
再次调试源码,发现DnsNameResolver是由DnsNameResolverBuilder来构造的
public final class DnsNameResolverBuilder {
...
private long queryTimeoutMillis = 5000;
...
copy源码到应用目录,改一下这个数字,完事