案例1
[问题澄清]
TiDB集群启动过程中报错:
[FATAL] [main.go:111] [“run server failed”] [error=“listen tcp 192.xxx.73.101:2380: bind: cannot assign requested address”]
[原因分析]
网络问题
[解决方案]
1.使用ping命令检查ip是否可以访问
2.使用telnel命令测试端口是否可以访问
3.在tidb集群中尽量避免内网和外网ip混用
[参考案例]
PD端口无法启动
https://asktug.com/t/pd/638
[引申学习点]
ping命令
https://blog.csdn.net/hebbely/article/details/54965989
telnet命令
https://blog.csdn.net/swazer_z/article/details/64442730
案例2
[问题澄清]
PD启动过程中报错:
[PANIC] [server.go:446] [“failed to recover v3 backend from snapshot”]
[error=“failed to find database snapshot file (snap: snapshot file doesn’t exist)”]
[原因分析]
服务器掉电,导致操作系统数据丢失
[解决方案]
1. 掉电后可能目录变为只读,请运维人员帮助从操作系统层面恢复只读文件
2. 如果TiDB集群有PD节点无法启动,建议使用pd-recover命令恢复
https://pingcap.com/docs-cn/stable/reference/tools/pd-recover/#pd-recover-%25E4%25BD%25BF%25E7%2594%25A8%25E6%2596%2587%25E6%25A1%25A3
[参考案例]
系统断电,来电后重启tidb集群,启动PD节点报错,3个PD节点有两个报错
https://asktug.com/t/tidb-pd-3-pd/1369
[学习引申点]
EXT4文件系统学习(五)掉电数据损坏重启挂载失败并修复,仅限参考非标准步骤,fsck失败可能导致数据损坏
https://blog.csdn.net/TSZ0000/article/details/84664865
案例3
[问题澄清]
TiKV启动过程中报错:
ERRO tikv-server.rs:155: failed to create kv engine: “Corruption: Sst file size mismatch: /data/tidb/deploy/data/db/67704904.sst. Size recorded in manifest 325143, actual size 0 ”]
[原因分析]
服务器重启,导致数据未及时sync
[解决方案]
下线节点,重新扩容,参考扩容缩容步骤
https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/
会在某个新版本修复此问题
https://github.com/tikv/tikv/pull/4807
[参考案例]
TiKV节点无法启动
https://asktug.com/t/tikv/1375
[学习引申点]
RocksDB - MANIFEST
https://www.jianshu.com/p/d1b38ce0d966
案例4
[问题澄清]
TiKV启动过程中报错:
ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at "/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248"3.2019/04/30 18:11:27.625 ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at “/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248” stack backtrace:stack backtrace:
[原因分析]
to_commit out of range 意味着这个 peer 想要 commit 一条不存在的日志,说明因某些主动操作或者异常情况发生导致最近的 raft log 丢失了
[解决方案]
1.通过 tikv-ctl 工具定位损坏的region,指定 db 目录(当前损坏 tikv 节点的目录)。
2.通过 tikv-ctl 进行数据修复。
2.1 如果修复失败。如下:
set_region_tombstone: StringError("The peer is still in target peers")
使用tikv-ctl 执行 region tombstone 需要对损坏节点 region peer 进行判断,需要人工清理。remove 掉异常的 peer。
2.2 重复使用 tikv-ctl 工具执行修复即可。
[参考案例]
TiKV 报错 ERRO panic_hook.rs:104 是什么原因
https://asktug.com/t/tikv-erro-panic-hook-rs-104/165
Tikv节点挂掉后,启动报错“[region 32] 33 to_commit 405937 is out of range [last_index 405933]”
https://asktug.com/t/tikv-region-32-33-to-commit-405937-is-out-of-range-last-index-405933/1922
[学习引申点]
Raft 日志复制 Log replication
https://www.jianshu.com/p/b28e73eefa88
案例5
[问题澄清]
PD启动过程中报错:
FAILED - RETRYING: wait until the PD health page is available (12 retries left). FAILED - RETRYING: wait until the PD health page is available (12 retries left)
[原因分析]
ip地址异常
[解决方案]
1.检查是否有内外网ip导致不通
2.是否是更换PD ip地址导致,可以采用扩容缩容的方法处理PD.
[参考案例]
节点IP变化后,如何操作更新
https://asktug.com/t/ip/1106
TiDB集群启动不起来
https://asktug.com/t/tidb/1563
[学习引申点]
TiDB 最佳实践系列(二)PD 调度策略最佳实践
https://pingcap.com/blog-cn/best-practice-pd/
案例6
[问题澄清]
TiDB无法启动,tidb_stderr.log报错:
fatal error: runtime: out of memory
[原因分析]
设置 echo 2 > /proc/sys/vm/overcommit_memory
[解决方案]
设置echo 0 > /proc/sys/vm/overcommit_memory
[参考案例]
修改内存使用策略导致 TiDB自动下线后 无法启动
https://asktug.com/t/tidb/1716
[学习引申点]
linux下overcommit_memory的问题
https://blog.csdn.net/houjixin/article/details/46412557
案例7
[问题澄清]
TiDB集群启动过程中报错:
Ansible FAILED! => playbook: start.yml; TASK: Check grafana API Key list; message: {“changed”: false, “connection”: “close”, “content”: “{“message”:“Invalid username or password”}”, “content_length”: “42”, “content_type”: “application/json; charset=UTF-8”, “date”: “Wed, 25 Dec 2019 02:22:44 GMT”, “json”: {“message”: “Invalid username or password”}, “msg”: “Status code was 401 and not [200]: HTTP Error 401: Unauthorized”, “redirected”: false, “status”: 401, “url”: “http://192.168.179.112:3000/api/auth/keys”}
[原因分析]
修改过 Grafana 的密码
[解决方案]
inventory.ini 中配置的用户名和密码也需要修改为新的密码
[参考案例]
启动集tidb集群出现错误
https://asktug.com/t/topic/2253
[学习引申点]
Grafana全面瓦解
https://www.jianshu.com/p/7e7e0d06709b
案例8
[问题澄清]
TiDB集群启动过程中TiDB日志报错:
[error="[global:3]critical error write binlog failed, the last error no avaliable pump to write binlog"]
[原因分析]
pump与Draine造成的
[解决方案]
pump错误为:fail to notify all living drainer: notify drainer。将drainer启动,然后成功下线后,start.yml执行成功
[参考案例]
tidb服务已经启动了,但是wait until the TiDB port is up失败
https://asktug.com/t/topic/2606
[学习引申点]
TiDB Binlog 简介
https://pingcap.com/docs-cn/stable/reference/tidb-binlog/overview/