以太坊公链节点连接节点超时问题排查

2020年4月1日晚上8点,zabbix报警:以太坊公链三分钟内没有检测到区块数据同步

立即登录到服务器,查看以太坊公链节点数据同步情况

# docker logs -f public-eth --tail 10

INFO [04-01|20:17:37.735] Deep froze chain segment                 blocks=44   elapsed=345.845ms number=9695993      hash=3833c6…28c205
WARN [04-01|20:18:10.932] Synchronisation failed, dropping peer    peer=63cbc31e7027052b err=timeout
WARN [04-01|20:25:26.171] Synchronisation failed, dropping peer    peer=4d27a5ef8b885210 err=timeout
WARN [04-01|20:26:21.815] Synchronisation failed, dropping peer    peer=6a224bc2c8c3b02c err=timeout

根据日志发现,区块数据已经落后最新区块高度50块左右,原因为连接的节点同步超时。

于是进入到以太坊的geth环境,查看连接的节点信息

> admin.peers  //查看连接的节点信息

{
    caps: ["eth/63", "eth/64", "eth/65", "les/2", "les/3", "shh/6"],
    enode: "enode://0e1806acd33408d618070c3e0a33e692af2a641493701a65f2f7c9d7f9de076a7d6087e4228f711428aa654155a41f2e4850491df2ab213703eb6ae13f10fa32@18.218.89.155:30303",
    id: "2bd5a38260608099ba57f4236645718e4fdbfb5df9566ae4059e2b38d4321dfa",
    name: "Geth/bluebear/v1.9.12-stable-b6f1c8dc/linux-amd64/go1.13.8",
    network: {
      inbound: false,
      localAddress: "172.17.0.2:36444",
      remoteAddress: "18.218.89.155:30303",
      static: false,
      trusted: false
    },
    protocols: {
      eth: {
        difficulty: 1.4790370433997981e+22,
        head: "0x4afbea3912db5ab6e0c5042498c2020dc0558f349b6505e4d5d23de087a0c0ae",
        version: 64
      }
    }
}
......

> net.peerCount   // 查看连接的有效节点数量
7

通过节点返回的IP及端口进行telnet网络连接测试,以及根据连接的有效节点数量来看,并未发现问题

怀疑应该是服务器网络问题,于是登录到以太坊公链节点备用环境

# docker logs -f public-eth --tail 1000
INFO [03-29|09:19:19.278] Regenerated local transaction journal    transactions=4  accounts=1
WARN [03-29|09:25:23.724] Synchronisation failed, dropping peer    peer=4b9ad7c0ab94dc10 err="retrieved hash chain is invalid"
WARN [03-29|09:38:02.704] Checkpoint challenge timed out, dropping id=e8191234978143d6 conn=dyndial addr=3.1.27.148:30303      type=Geth/source/linux/go1.10.4
WARN [03-29|09:40:02.373] Synchronisation failed, dropping peer    peer=0be2160d6093d7a3 err=timeout
WARN [03-29|09:42:43.915] Checkpoint challenge timed out, dropping id=b7fe283b8834fc21 conn=dyndial addr=188.35.22.31:30307    type=Geth/source/linux/go1.11.2
WARN [03-29|09:48:31.330] Synchronisation failed, dropping peer    peer=209fadad4df29ee7 err=timeout
WARN [03-29|09:50:49.572] Synchronisation failed, dropping peer    peer=a46a49badbd2205b err=timeout
WARN [03-29|09:51:25.420] Synchronisation failed, dropping peer    peer=a41421cb9772370b err=timeout

发现备用节点早就未同步数据了(因为是备用节点,所以未设置zabbix报警)

在备用节点中同样使用了以上方法,查看了连接的节点信息,有效节点数量,并未发现异常,然后使用了以下命令查询了区块高度信息,发现数据差的真多,害怕。。。

> eth.syncing
{
  currentBlock: 9739351,
  highestBlock: 9786347,
  knownStates: 399147124,
  pulledStates: 399147124,
  startingBlock: 9725124
}
> eth.blockNumber
9739410

备用环境的节点没有发现什么异常,于是对服务进行了重启(因为是备用节点,所以才进行了重启操作,生产环境必须慎重使用服务重启、关闭等操作)

# docker restart public-eth ;docker logs -f public-eth --tail 10
public-eth
。。。。。。
INFO [04-01|20:34:16.546] Setting new local account                address=0x52df2CE99891c31314c9f2f97dE1eBf401806571
INFO [04-01|20:34:16.546] Loaded local transaction journal         transactions=11 dropped=0
INFO [04-01|20:34:16.546] Regenerated local transaction journal    transactions=11 accounts=1
WARN [04-01|20:34:16.546] Switch sync mode from fast sync to full sync 
INFO [04-01|20:34:16.743] New local node record                    seq=31 id=4b66feada6a11d9d ip=127.0.0.1 udp=30303 tcp=30303
INFO [04-01|20:34:16.744] Started P2P networking                   self=enode://f82c0b9f10906785fed6e1ee2f86c165da6dc43155f0799bb63e713d677786b2d4b56124389b4e32f7c268f79877a87eed9ffd47f9e25d52976108aee39810a9@127.0.0.1:30303
INFO [04-01|20:34:16.747] IPC endpoint opened                      url=/root/.ethereum/geth.ipc
INFO [04-01|20:34:16.748] HTTP endpoint opened                     url=http://0.0.0.0:8545      cors= vhosts=localhost
INFO [04-01|20:34:26.744] Block synchronisation started 
INFO [04-01|20:34:31.518] New local node record                    seq=32 id=4b66feada6a11d9d ip=47.244.12.0 udp=58724 tcp=30303
INFO [04-01|20:34:36.316] Importing heavy sidechain segment        blocks=2048 start=9722054 end=9724101
INFO [04-01|20:34:48.324] Imported new chain segment               blocks=1    txs=86 mgas=9.983 elapsed=12.006s mgasps=0.832 number=9722054 hash=695f2b…b44224 age=1w2d21h dirty=1.08MiB
INFO [04-01|20:34:56.746] Imported new chain segment               blocks=9    txs=1378 mgas=87.184 elapsed=8.421s  mgasps=10.353 number=9722063 hash=75ea98…258dcd age=1w2d21h dirty=12.45MiB
INFO [04-01|20:35:05.076] Imported new chain segment               blocks=17   txs=2583 mgas=157.750 elapsed=8.329s  mgasps=18.938 number=9722080 hash=c7ff1b…6e622b age=1w2d21h dirty=31.53MiB
INFO [04-01|20:35:13.222] Imported new chain segment               blocks=20   txs=1967 mgas=151.433 elapsed=8.146s  mgasps=18.590 number=9722100 hash=19fbd6…b3d86b age=1w2d21h dirty=49.69MiB

节点重启后,发现以太坊备用节点开始同步数据,进行恢复

遇到问题后,第一时间也进行百度查询过此问题:Synchronisation failed, dropping peer

网上给出的回复是:

日志一致卡在此处,说明geth没有链接到其他有效的节点,通过cosole后台执行以下命令可看到链接的节点数为0:

> net.peerCount
0

针对此警告等待即可,如果长时间无响应,建议重新启动节点,让节点重新寻找新的peers。同时也可以手动添加peer。星火计划提供的节点如下列表,可尝试添加:

怀疑可能是因为节点重启后,刷新了连接的节点信息,怀疑可能是连接的节点有问题,导致了连接节点超时,数据未同步

于是想到使用如下方法进行解决

当出现节点timeout的问题报错时,可以根据以太坊源码中给出的引导节点信息,进行添加,帮助寻找可用的有效节点

引导节点信息为:

var MainnetBootnodes = []string{
	// Ethereum Foundation Go Bootnodes
	"enode://d860a01f9722d78051619d1e2351aba3f43f943f6f00718d1b9baa4101932a1f5011f16bb2b1bb35db20d6fe28fa0bf09636d26a87d31de9ec6203eeedb1f666@18.138.108.67:30303",   // bootnode-aws-ap-southeast-1-001
	"enode://22a8232c3abc76a16ae9d6c3b164f98775fe226f0917b0ca871128a74a8e9630b458460865bab457221f1d448dd9791d24c4e5d88786180ac185df813a68d4de@3.209.45.79:30303",     // bootnode-aws-us-east-1-001
	"enode://ca6de62fce278f96aea6ec5a2daadb877e51651247cb96ee310a318def462913b653963c155a0ef6c7d50048bba6e6cea881130857413d9f50a621546b590758@34.255.23.113:30303",   // bootnode-aws-eu-west-1-001
	"enode://279944d8dcd428dffaa7436f25ca0ca43ae19e7bcf94a8fb7d1641651f92d121e972ac2e8f381414b80cc8e5555811c2ec6e1a99bb009b3f53c4c69923e11bd8@35.158.244.151:30303",  // bootnode-aws-eu-central-1-001
	"enode://8499da03c47d637b20eee24eec3c356c9a2e6148d6fe25ca195c7949ab8ec2c03e3556126b0d7ed644675e78c4318b08691b7b57de10e5f0d40d05b09238fa0a@52.187.207.27:30303",   // bootnode-azure-australiaeast-001
	"enode://103858bdb88756c71f15e9b5e09b56dc1be52f0a5021d46301dbbfb7e130029cc9d0d6f73f693bc29b665770fff7da4d34f3c6379fe12721b5d7a0bcb5ca1fc1@191.234.162.198:30303", // bootnode-azure-brazilsouth-001
	"enode://715171f50508aba88aecd1250af392a45a330af91d7b90701c436b618c86aaa1589c9184561907bebbb56439b8f8787bc01f49a7c77276c58c1b09822d75e8e8@52.231.165.108:30303",  // bootnode-azure-koreasouth-001
	"enode://5d6d7cd20d6da4bb83a1d28cadb5d409b64edf314c0335df658c1a54e32c7c4a7ab7823d57c39b6a757556e68ff1df17c748b698544a55cb488b52479a92b60f@104.42.217.25:30303",   // bootnode-azure-westus-001
}

连接方式为(举例):

admin.addPeer("enode://d860a01f9722d78051619d1e2351aba3f43f943f6f00718d1b9baa4101932a1f5011f16bb2b1bb35db20d6fe28fa0bf09636d26a87d31de9ec6203eeedb1f666@18.138.108.67:30303")

admin.addPeer("enode://22a8232c3abc76a16ae9d6c3b164f98775fe226f0917b0ca871128a74a8e9630b458460865bab457221f1d448dd9791d24c4e5d88786180ac185df813a68d4de@3.209.45.79:30303")

admin.addPeer("enode://ca6de62fce278f96aea6ec5a2daadb877e51651247cb96ee310a318def462913b653963c155a0ef6c7d50048bba6e6cea881130857413d9f50a621546b590758@34.255.23.113:30303")

admin.addPeer("enode://5d6d7cd20d6da4bb83a1d28cadb5d409b64edf314c0335df658c1a54e32c7c4a7ab7823d57c39b6a757556e68ff1df17c748b698544a55cb488b52479a92b60f@104.42.217.25:30303")

admin.addPeer("enode://715171f50508aba88aecd1250af392a45a330af91d7b90701c436b618c86aaa1589c9184561907bebbb56439b8f8787bc01f49a7c77276c58c1b09822d75e8e8@52.231.165.108:30303")

admin.addPeer("enode://a0c97b58f2d3ea039cf09d8d9255c6d635f605f8702fafaafeda173ad0ae81c717c9f0ec155be615868c5eb027f8e28b2e9cab31ddddbb1b3e664c13eccb649a@159.69.56.17:30303?discport=1062

后来发现正式环境自己恢复并且同步到最新数据了,怀疑可能和网络也有一定的关系

结果不到三分钟,正式环境又开始报错节点timeout,出现数据不同步的情况,这个时候再去看备用节点,并未发现有异常信息。

因为以太坊公链节点的正式环境与备用环境服务器都在香港地区,所以可排除是网路的问题

再次查看正式环境的连接的节点信息时,发现有一个1.9.3的版本,1.9.9为以太坊缪尔冰川硬分叉的版本,如果有节点版本比它低,可说明此节点的区块一定未超过920块,是一个坏节点

> admin.peers

{
    caps: ["eth/63"],
    enode: "enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303",
    id: "3b95c0d3e14e5021e214c81a83c346eb997ec194fb1bd2e7a79b23d0d92a6198",
    name: "Geth/v1.9.3-stable-cfbb969d/linux-amd64/go1.11.5",
    network: {
      inbound: false,
      localAddress: "172.17.0.2:37444",
      remoteAddress: "209.250.230.142:30303",
      static: false,
      trusted: false
    },
    protocols: {
      eth: {
        difficulty: 1.3520836652877566e+22,
        head: "0xab865ff4c05af71a415426ce40bf7656194bcbea3ed0b0b2f0c64767b30f2982",
        version: 63
      }
    }
}

于是使用命令对它进行了删除

> admin.removePeer("enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303")
true

删除后,正式环境的节点开始正常同步

结论,可发现以太坊正式环境的节点数据不同步,并且出现如下报错:

Synchronisation failed, dropping peer    peer=6a224bc2c8c3b02c err=timeout

其中的原因会是因为以太坊公链节点连接了版本较低的节点【以太坊硬分叉之前的版本】导致,此时可使用如下命令对其连接的节点进行删除

admin.removePeer("enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303")
  • 添加节点
admin.addPeer("enode://3af83ae28fc90838c334369ed2bf8071065062b851e5845e5eb07bd2efc5ba68f9d77865bea3ea09d3cc866bded716c258b0bca002696a69463fba7fdefb51df@128.230.208.74:30303")
true
  • 添加信任节点
addTrustedPeer("enode://3af83ae28fc90838c334369ed2bf8071065062b851e5845e5eb07bd2efc5ba68f9d77865bea3ea09d3cc866bded716c258b0bca002696a69463fba7fdefb51df@128.230.208.74:30303")
true

你可能感兴趣的:(docker,centos,区块链,以太坊)