TiDB 启动问题记录

案例1

[问题澄清]

TiDB集群启动过程中报错:

[FATAL] [main.go:111] [“run server failed”] [error=“listen tcp 192.xxx.73.101:2380: bind: cannot assign requested address”]


            [原因分析]

            网络问题

            [解决方案]

1.使用ping命令检查ip是否可以访问

2.使用telnel命令测试端口是否可以访问

3.在tidb集群中尽量避免内网和外网ip混用

[参考案例]

PD端口无法启动

https://asktug.com/t/pd/638

[引申学习点]

ping命令

https://blog.csdn.net/hebbely/article/details/54965989

telnet命令

https://blog.csdn.net/swazer_z/article/details/64442730

案例2

[问题澄清]

PD启动过程中报错:

[PANIC] [server.go:446] [“failed to recover v3 backend from snapshot”]

[error=“failed to find database snapshot file (snap: snapshot file doesn’t exist)”]

[原因分析]

服务器掉电,导致操作系统数据丢失

[解决方案]

1. 掉电后可能目录变为只读,请运维人员帮助从操作系统层面恢复只读文件

2. 如果TiDB集群有PD节点无法启动,建议使用pd-recover命令恢复

https://pingcap.com/docs-cn/stable/reference/tools/pd-recover/#pd-recover-%25E4%25BD%25BF%25E7%2594%25A8%25E6%2596%2587%25E6%25A1%25A3

[参考案例]

系统断电,来电后重启tidb集群,启动PD节点报错,3个PD节点有两个报错

https://asktug.com/t/tidb-pd-3-pd/1369

[学习引申点]

EXT4文件系统学习(五)掉电数据损坏重启挂载失败并修复,仅限参考非标准步骤,fsck失败可能导致数据损坏

https://blog.csdn.net/TSZ0000/article/details/84664865

案例3

[问题澄清]

TiKV启动过程中报错:

ERRO tikv-server.rs:155: failed to create kv engine: “Corruption: Sst file size mismatch: /data/tidb/deploy/data/db/67704904.sst. Size recorded in manifest 325143, actual size 0 ”]

[原因分析]

服务器重启,导致数据未及时sync

[解决方案]

下线节点,重新扩容,参考扩容缩容步骤

https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/

会在某个新版本修复此问题

 https://github.com/tikv/tikv/pull/4807

[参考案例]

TiKV节点无法启动

https://asktug.com/t/tikv/1375

[学习引申点]

RocksDB - MANIFEST

https://www.jianshu.com/p/d1b38ce0d966

案例4

[问题澄清]

TiKV启动过程中报错:

ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at "/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248"3.2019/04/30 18:11:27.625 ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at “/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248” stack backtrace:stack backtrace:

[原因分析]

to_commit out of range 意味着这个 peer 想要 commit 一条不存在的日志,说明因某些主动操作或者异常情况发生导致最近的 raft log 丢失了

[解决方案]

1.通过 tikv-ctl 工具定位损坏的region,指定 db 目录(当前损坏 tikv 节点的目录)。

2.通过 tikv-ctl 进行数据修复。

2.1 如果修复失败。如下:

set_region_tombstone: StringError("The peer is still in target peers")

使用tikv-ctl 执行 region tombstone 需要对损坏节点 region peer 进行判断,需要人工清理。remove 掉异常的 peer。

2.2 重复使用 tikv-ctl 工具执行修复即可。

[参考案例]

TiKV 报错 ERRO panic_hook.rs:104 是什么原因

https://asktug.com/t/tikv-erro-panic-hook-rs-104/165

Tikv节点挂掉后,启动报错“[region 32] 33 to_commit 405937 is out of range [last_index 405933]”

https://asktug.com/t/tikv-region-32-33-to-commit-405937-is-out-of-range-last-index-405933/1922

[学习引申点]

Raft 日志复制 Log replication

https://www.jianshu.com/p/b28e73eefa88

案例5

[问题澄清]

PD启动过程中报错:

FAILED - RETRYING: wait until the PD health page is available (12 retries left). FAILED - RETRYING: wait until the PD health page is available (12 retries left)

[原因分析]

ip地址异常

[解决方案]

1.检查是否有内外网ip导致不通

2.是否是更换PD ip地址导致,可以采用扩容缩容的方法处理PD.

[参考案例]

节点IP变化后,如何操作更新

https://asktug.com/t/ip/1106

TiDB集群启动不起来

https://asktug.com/t/tidb/1563

[学习引申点]

TiDB 最佳实践系列(二)PD 调度策略最佳实践

https://pingcap.com/blog-cn/best-practice-pd/

案例6

[问题澄清]

TiDB无法启动,tidb_stderr.log报错:

fatal error: runtime: out of memory

[原因分析]

设置 echo 2 > /proc/sys/vm/overcommit_memory

[解决方案]

设置echo 0 > /proc/sys/vm/overcommit_memory

[参考案例]

修改内存使用策略导致 TiDB自动下线后 无法启动

https://asktug.com/t/tidb/1716

[学习引申点]

linux下overcommit_memory的问题

https://blog.csdn.net/houjixin/article/details/46412557

案例7

[问题澄清]

TiDB集群启动过程中报错:

Ansible FAILED! => playbook: start.yml; TASK: Check grafana API Key list; message: {“changed”: false, “connection”: “close”, “content”: “{“message”:“Invalid username or password”}”, “content_length”: “42”, “content_type”: “application/json; charset=UTF-8”, “date”: “Wed, 25 Dec 2019 02:22:44 GMT”, “json”: {“message”: “Invalid username or password”}, “msg”: “Status code was 401 and not [200]: HTTP Error 401: Unauthorized”, “redirected”: false, “status”: 401, “url”: “http://192.168.179.112:3000/api/auth/keys”}

[原因分析]

修改过 Grafana 的密码

[解决方案]

inventory.ini 中配置的用户名和密码也需要修改为新的密码

[参考案例]

启动集tidb集群出现错误

https://asktug.com/t/topic/2253

[学习引申点]

Grafana全面瓦解

https://www.jianshu.com/p/7e7e0d06709b

案例8

[问题澄清]

TiDB集群启动过程中TiDB日志报错:

[error="[global:3]critical error write binlog failed, the last error no avaliable pump to write binlog"]

[原因分析]

pump与Draine造成的

[解决方案]

pump错误为:fail to notify all living drainer: notify drainer。将drainer启动,然后成功下线后,start.yml执行成功

[参考案例]

tidb服务已经启动了,但是wait until the TiDB port is up失败

https://asktug.com/t/topic/2606

[学习引申点]

TiDB Binlog 简介

https://pingcap.com/docs-cn/stable/reference/tidb-binlog/overview/

你可能感兴趣的:(TiDB 启动问题记录)