Mysql 使用 B+ tree 这种数据结构保存数据,Cassandra 使用 LSM tree。其结果就是 Cassandra 的删除操作也变成了一种写入操作,只不过写入的是一个 tomb, 而真正的删除,在 compaction 时才发生。
如果cassandra开启了 TRACING ON, 然后查询报错:
多半是 system_traces KEYSPACE 的配置有问题,导致跨 DC 写数据超时(因为开启tracing需要向 system_traces 中的 table 写入数据),需要:
alter keyspace system_traces xxx
关于 nodetool repair:
nodetool repair # 默认会repair 该node自己的token range 和 保存的 replicas 的 token range, 所以会牵扯到好几个node的相互之间数据对比
nodetool repair -pr # 只会 repair 该node 自己的token range, 所以,如果需要将某个keyspace的数据全部repair,需要在所有 hold 这个 keyspace的 node 上运行 nodetool repair -pr
除了手动执行 repair, cassandra 还会在query该数据的时候触发 repair, 即 read repair. 关于 Read Repair:
Two types of read repair: foreground and background.
Foreground here means blocking – we complete all operations before returning to the client. Background means non-blocking – we begin the background repair operation and then return to the client before it has completed.
关于配置 background Read Repair 的参数:
dclocal_read_repair_chance = 0.1 # 同一个dc中随机read_repair的概率是 10%
read_repair_chance = 0.0 # cluster 中随机 read_repaire的概率是 0
speculative_retry = xxx
• ALWAYS: The coordinator node always sends extra read requests to all replicas after every read of this table.
• NONE: The coordinator node never sends extra read requests after a read of this table.
• Xpercentile: The coordinator node sends extra read requests if the table’s latency is higher than normal. That is, extra reads are triggered after a percent of the normal read latency of a table elapses. For example, the coordinator sends out extra read requests after waiting for 48ms (80 percent of 60ms) if a table’s latency is 60 seconds on average.
• Nms: The coordinator node sends out extra read requests if it doesn’t receive any results from the target node in N milliseconds.
注意, 参数 dclocal_read_repair_chance 和 speculative_retry 同时使用好像有个bug,或导致跨DC 的read repair, check this link, 需要关闭其一,例如:
alter table test with dclocal_read_repair_chance=0.0;
关于如何查看 replication latency:
replication 发生各个node之间,通过查看CrossNodeLatency 指标可以看到 cassandra的复制延时:
注意, cassandra 的Latency metrics 是统计node争个存活期间的数据,其中的Count是单调递增的,只有在node重启时才会reset.
Count=117 表示从该node start开始,接收到177个Message,其中平均延时是 520712.29534117645 微秒(0.52 秒);
OneMinuteRate = 1.356... 表示每分钟平均收到 1.35 个Message;
99thPercentile : 5960319.812 表示 99%的消息在 5.96 秒内收到;
…
curl http://127.0.0.1:8778/org.apache.cassandra.metrics:type=Messaging,name=* | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2311 0 2311 0 0 2274 0 --:--:-- 0:00:01 --:--:-- 2274
{
"request": {
"mbean": "org.apache.cassandra.metrics:name=*,type=Messaging",
"type": "read"
},
"value": {
"org.apache.cassandra.metrics:name=CrossNodeLatency,type=Messaging": {
"Mean": 520712.29534117645,
"StdDev": 1200436.6643582033,
"75thPercentile": 268650.95,
"98thPercentile": 5960319.812,
"RateUnit": "events/second",
"95thPercentile": 4139110.981,
"99thPercentile": 5960319.812,
"Max": 5960319.812,
"Count": 117,
"FiveMinuteRate": 0.33695134016087785,
"50thPercentile": 186563.16,
"MeanRate": 2.661284598969726,
"Min": 2816.16,
"OneMinuteRate": 1.3564407341962146,
"DurationUnit": "microseconds",
"999thPercentile": 5960319.812,
"FifteenMinuteRate": 0.11664276131878955
}
},
"timestamp": 1586250903,
"status": 200
}
Cassandra使用总结:
1. cassandra 在跨区传输时,如果网络丢包严重,当写入量很大时,会导致产生很多的hints, 影响数据同步。可在 /var/lib/cassandar/hints 目录下看到大量文件。适当提高 hinted_handoff_throttle_in_kb,max_hints_delivery_threads 能改善同步性能,但是不能解决根本问题。最好是设置好replicates 的本地存储。避免跨网络不好的的replicates传输。查看日志能看到coordinater又大量的日志:
INFO [HintsDispatcher:73] 2020-05-25 03:26:03,615 HintsDispatchExecutor.java:289 - Finished hinted handoff of file 6766cc74-59a5-455f-90cf-2ef8b91df6ea-1590376417285-1.hints to endpoint /1.2.3.4: 6766cc74-59a5-455f-90cf-2ef8b91df6ea, partially
查看 metric 也能看到很多HintsTimeOut:
curl http://localhost:8778/jolokia/list
curl http://localhost:8778/jolokia/read/org.apache.cassandra.metrics:type=HintsService,name=HintsTimedOut | python -m json.tool
这时使用 nodetool repair
也无济于事,只能等很长时间后(根据数据量和网络情况,数分钟至数小时。。。),coordinater 将所有的hints传到该node,数据才能同步上。
2 . 在写入时序数据时,可设置compatction为 TimeWindowCompactionStrategy
,但是注意,如果 gc_grace_seconds 时间设置为0,会导致不能生成 hints(或者生成 hints 立即 expire了). check this link.
三个参数:
推荐设置: