Cassandra数据修复失败问题

背景

为了保证cassandra不同节点数据的一致性,需要定期进行repair操作。但是,当数据量达到一定规模时,repair操作并不简单,经常会遇到这样那样的问题,导致修复失败。本文梳理一些常见的错误,以及对应的解决办法。

Some repair failed 错误

执行nodetool repair keyspace table命令,可能出现如下错误信息

java.lang.RuntimeException: Repair job has failed with the error message: [2020-08-28 16:27:23,499] Some repair failed
    at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
    at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
    at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
    at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
    at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
    at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

只从这个错误信息完全看不出问题出在哪里,需要到logs/system.log日志里查询详细的错误信息。

cat logs/system.log | grep ERROR -A10查看日志。如果是Validation failed错误,例如:

... Validation failed in /10.10.10.45
    at org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:64) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:182) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:493) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:162) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_171]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]

这说明10.10.10.45这个节点的数据有异常,校验失败。这时,需要到该节点上执行scrub命令,丢弃掉损坏数据。之后,再返回上一步,重新执行repair操作即可。

如果是其他错误,例如:

ERROR [GossipTasks:1] 2020-08-28 15:44:16,628 RepairSession.java:338 - [repair #f22193b0-e900-11ea-aee6-ef48888d996a] session completed with the following error
java.io.IOException: Endpoint /10.10.10.45 died
    at org.apache.cassandra.repair.RepairSession.convict(RepairSession.java:337) ~[apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:307) [apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:802) [apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:68) [apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:194) [apache-cassandra-3.11.2.jar:3.11.2]
    at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) [apache-cassandra-3.11.2.jar:3.11.2]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_171]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_171]

这类错误,在节点磁盘、网络等压力较大时会出现,此时再次执行repair即可。

Repair进程卡住不动问题

执行repair命令时,也可能会遇到进程卡着不动的情况,查看repair进程存在,但是所有节点都没有compaction任务(repair和scrub会在相关节点上触发compaction任务,可通过compactionstats命令查看)。

这时,可以尝试更换修复方式,例如,全量修复repair -full -pr卡住不动,可以尝试改为增量修复方式。

3.x版本最好使用增量修复方式,不加其他参数,默认就是增量修复。

流程总结

  • 通过日志查看哪个节点有问题
  • 到对应节点上执行scrub,丢弃已损坏的数据
  • 重新执行repair
  • 执行listsnapshots,查看快照(scrub会生成快照)
  • 执行clearsnapshot,清除快照

你可能感兴趣的:(Cassandra数据修复失败问题)