cassandra的HintedHandOff的功能主要是当节点掉线,或者各个副本节点中网络闪断的时候,此时数据的保存,等节点间恢复正常以后,副本数据就可以一致了。
1、Hints表的表结构
当副本节点传输数据失败以后,数据就会被保存在hints系统表中。
hints系统表的表结构为,compositeCF,rowkey为target_id,即节点的token值的uuid;
compositeName分别为hint_id 由产生hint数据的时间生成的uuid;和message_version为产生hint数据的时候数据的版本号;
普通字段有:mutation主要存在需要传输的数据RowMutation。
这里每一条hint数据就相当于一个Column
* The hint schema looks like this: * * CREATE TABLE hints ( * target_id uuid, * hint_id timeuuid, * message_version int, * mutation blob, * PRIMARY KEY (target_id, hint_id, message_version) * ) WITH COMPACT STORAGE; |
当数据传输失败以后,就会直接调用RowMutation中的hintFor方法,将数据转化为需要插入hints表中的数据格式。
/**
* Returns mutation representing a Hints to be sent to address
* as soon as it becomes available. See HintedHandoffManager for more details.
*/
public static RowMutation hintFor(RowMutation mutation, UUID targetId) throws IOException
{
RowMutation rm = new RowMutation(Table.SYSTEM_KS, UUIDType.instance.decompose(targetId));
UUID hintId = UUIDGen.getTimeUUID();
// determine the TTL for the RowMutation
// this is set at the smallest GCGraceSeconds for any of the CFs in the RM
// this ensures that deletes aren't "undone" by delivery of an old hint
int ttl = Integer.MAX_VALUE;
for (ColumnFamily cf : mutation.getColumnFamilies())
ttl = Math.min(ttl, cf.metadata().getGcGraceSeconds());
// serialize the hint with id and version as a composite column name
QueryPath path = new QueryPath(SystemTable.HINTS_CF, null, HintedHandOffManager.comparator.decompose(hintId, MessagingService.current_version));
rm.add(path, ByteBuffer.wrap(FBUtilities.serialize(mutation, serializer, MessagingService.current_version)), System.currentTimeMillis(), ttl);
return rm;
}
2、Hint数据的产生
cassandra这种nosql型数据库,为了保证数据的安全性,需要将一份数据存储在几个副本中,防止一个节点掉线或者节点数据有损坏的时候,数据还可以通过其他的节点访问和恢复,这样又不会影响当前的业务,最后也可以恢复数据。
副本间数据进行传输的时候,当发送失败,以后判断是否记录hint数据的条件有两个:
(1)系统是否开启hintedHandOff功能,通过cassandra.yaml文件中的hinted_handoff_enabled配置项决定,默认开启
(2)Gossiper节点监控工具判断节点掉线时间是否超过了记录hint的最大时间,记录hint的最大时间为cassandra.yaml文件中的max_hint_window_in_ms决定,文件中的默认值是3hour,但是如果没有该配置项则是1hour
public static boolean shouldHint(InetAddress ep)
{
if (!DatabaseDescriptor.hintedHandoffEnabled())
{
HintedHandOffManager.instance.metrics.incrPastWindow(ep);
return false;
}
boolean hintWindowExpired = Gossiper.instance.getEndpointDowntime(ep) > DatabaseDescriptor.getMaxHintWindow();
if (hintWindowExpired)
{
HintedHandOffManager.instance.metrics.incrPastWindow(ep);
logger.trace("not hinting {} which has been down {}ms", ep, Gossiper.instance.getEndpointDowntime(ep));
}
return !hintWindowExpired;
}
当决定写hint数据的时候,就会将需要写的hint数据提交给HintRunnable线程,完成记录hint数据的责任。此时系统会记录正在运行的hint任务个数,以及异常节点产生的hint任务。
public static Future submitHint(final RowMutation mutation,
final InetAddress target,
final AbstractWriteResponseHandler responseHandler,
final ConsistencyLevel consistencyLevel)
{
// local write that time out should be handled by LocalMutationRunnable
assert !target.equals(FBUtilities.getBroadcastAddress()) : target;
HintRunnable runnable = new HintRunnable(target)
{
public void runMayThrow() throws IOException
{
logger.debug("Adding hint for {}", target);
writeHintForMutation(mutation, target);
// Notify the handler only for CL == ANY
if (responseHandler != null && consistencyLevel == ConsistencyLevel.ANY)
responseHandler.response(null);
}
};
return submitHint(runnable);
}
private static Future submitHint(HintRunnable runnable)
{
totalHintsInProgress.incrementAndGet();
hintsInProgress.get(runnable.target).incrementAndGet();
return (Future) StageManager.getStage(Stage.MUTATION).submit(runnable);
}
系统记录的正在运行的hint任务个数,以及异常节点产生的hint任务是为了防止正常节点因为处理hint数据过多导致系统内存溢出,也需要保护在线的节点,所以如果正在运行的hint任务不能超过1024 * FBUtilities.getAvailableProcessors();
// avoid OOMing due to excess hints. we need to do this check even for "live" nodes, since we can
// still generate hints for those if it's overloaded or simply dead but not yet known-to-be-dead.
// The idea is that if we have over maxHintsInProgress hints in flight, this is probably due to
// a small number of nodes causing problems, so we should avoid shutting down writes completely to
// healthy nodes. Any node with no hintsInProgress is considered healthy.
if (totalHintsInProgress.get() > maxHintsInProgress
&& (hintsInProgress.get(destination).get() > 0 && shouldHint(destination)))
{
throw new OverloadedException("Too many in flight hints: " + totalHintsInProgress.get());
}
3、Hint的发送机制
hint发送触发的原因有
(1)当节点启动加入集群的时候,就会启动一个后台进程,每个10min就会进行一次hint数据的处理。
(2)节点状态发送改变的时候,变为live状态以后,也会触发发送该节点hint数据的请求。
首先查询出hints表里面的数据,(隐患:这里会查询出所有的数据,如果hint表中数据过多,则会OutMemery),将每个节点的数据顺序处理。
然后只需要用其rowkey换成为节点ip,调用HintedHandoff线程池中的分发线程进行数据分发,HintedHandoff线程池中的线程个数由cassandra.yaml文件中的max_hints_delivery_threads: 2决定。
Runnable runnable = new Runnable()
{
public void run()
{
scheduleAllDeliveries();
metrics.log();
}
};
StorageService.optionalTasks.scheduleWithFixedDelay(runnable, 10, 10, TimeUnit.MINUTES);
单个节点的数据发送,如果判断hintedHandOff暂停了以后,则也不会进行发送,这个暂停可以由nodetool中的PAUSEHANDOFF命令暂停,由RESUMEHANDOFF 命令恢复。
在得知节点的schema与本节点的schema版本一致,并且是在线节点以后则开始hint数据的发送
(1)查询每128个条记录作为一批需要发送的hint数据,
(2)将每条数据的Mutation列,发序列化为对应的Rowmutation,构造成消息,
(3)发送成功以后,则写入一个deleteColumn,删除对应的记录
(4)等制定节点的所有数据发送完成以后,就会强制flush hints系统表,将hint的所有SStable进行一次全量SSTable的compact。
private void deliverHintsToEndpointInternal(InetAddress endpoint) throws IOException, DigestMismatchException, InvalidRequestException, InterruptedException
{
ColumnFamilyStore hintStore = Table.open(Table.SYSTEM_KS).getColumnFamilyStore(SystemTable.HINTS_CF);
if (hintStore.isEmpty())
return; // nothing to do, don't confuse users by logging a no-op handoff
// check if hints delivery has been paused
if (hintedHandOffPaused)
{
logger.debug("Hints delivery process is paused, aborting");
return;
}
logger.debug("Checking remote({}) schema before delivering hints", endpoint);
try
{
waitForSchemaAgreement(endpoint);
}
catch (TimeoutException e)
{
return;
}
if (!FailureDetector.instance.isAlive(endpoint))
{
logger.debug("Endpoint {} died before hint delivery, aborting", endpoint);
return;
}
// 1. Get the key of the endpoint we need to handoff
// 2. For each column, deserialize the mutation and send it to the endpoint
// 3. Delete the subcolumn if the write was successful
// 4. Force a flush
// 5. Do major compaction to clean up all deletes etc.
// find the hints for the node using its token.
UUID hostId = Gossiper.instance.getHostId(endpoint);
logger.info("Started hinted handoff for host: {} with IP: {}", hostId, endpoint);
final ByteBuffer hostIdBytes = ByteBuffer.wrap(UUIDGen.decompose(hostId));
DecoratedKey epkey = StorageService.getPartitioner().decorateKey(hostIdBytes);
final AtomicInteger rowsReplayed = new AtomicInteger(0);
ByteBuffer startColumn = ByteBufferUtil.EMPTY_BYTE_BUFFER;
int pageSize = PAGE_SIZE;
// read less columns (mutations) per page if they are very large
if (hintStore.getMeanColumns() > 0)
{
int averageColumnSize = (int) (hintStore.getMeanRowSize() / hintStore.getMeanColumns());
pageSize = Math.min(PAGE_SIZE, DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize);
pageSize = Math.max(2, pageSize); // page size of 1 does not allow actual paging b/c of >= behavior on startColumn
logger.debug("average hinted-row column size is {}; using pageSize of {}", averageColumnSize, pageSize);
}
// rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
int throttleInKB = DatabaseDescriptor.getHintedHandoffThrottleInKB();
RateLimiter rateLimiter = RateLimiter.create(throttleInKB == 0 ? Double.MAX_VALUE : throttleInKB * 1024);
while (true)
{
// check if hints delivery has been paused during the process
if (hintedHandOffPaused)
{
logger.debug("Hints delivery process is paused, aborting");
break;
}
QueryFilter filter = QueryFilter.getSliceFilter(epkey, new QueryPath(SystemTable.HINTS_CF), startColumn, ByteBufferUtil.EMPTY_BYTE_BUFFER, false, pageSize);
ColumnFamily hintsPage = ColumnFamilyStore.removeDeleted(hintStore.getColumnFamily(filter), (int)(System.currentTimeMillis() / 1000));
if (pagingFinished(hintsPage, startColumn))
{
if (ByteBufferUtil.EMPTY_BYTE_BUFFER.equals(startColumn))
{
// we've started from the beginning and could not find anything (only maybe tombstones)
break;
}
else
{
// restart query from the first column until we read an empty row;
// that will tell us everything was delivered successfully with no timeouts
startColumn = ByteBufferUtil.EMPTY_BYTE_BUFFER;
continue;
}
}
for (final IColumn hint : hintsPage.getSortedColumns())
{
// Skip tombstones:
// if we iterate quickly enough, it's possible that we could request a new page in the same millisecond
// in which the local deletion timestamp was generated on the last column in the old page, in which
// case the hint will have no columns (since it's deleted) but will still be included in the resultset
// since (even with gcgs=0) it's still a "relevant" tombstone.
if (!hint.isLive())
continue;
if (hintedHandOffPaused)
{
logger.debug("Hints delivery process is paused, aborting");
break;
}
startColumn = hint.name();
ByteBuffer[] components = comparator.split(hint.name());
int version = Int32Type.instance.compose(components[1]);
DataInputStream in = new DataInputStream(ByteBufferUtil.inputStream(hint.value()));
RowMutation rm;
try
{
rm = RowMutation.serializer.deserialize(in, version);
}
catch (UnknownColumnFamilyException e)
{
logger.debug("Skipping delivery of hint for deleted columnfamily", e);
deleteHint(hostIdBytes, hint.name(), hint.maxTimestamp());
continue;
}
MessageOut message = rm.createMessage();
rateLimiter.acquire(message.serializedSize(MessagingService.current_version));
WrappedRunnable callback = new WrappedRunnable()
{
public void runMayThrow() throws IOException
{
rowsReplayed.incrementAndGet();
deleteHint(hostIdBytes, hint.name(), hint.maxTimestamp());
}
};
IAsyncCallback responseHandler = new WriteResponseHandler(endpoint, WriteType.UNLOGGED_BATCH, callback);
MessagingService.instance().sendRR(message, endpoint, responseHandler);
}
// check if node is still alive and we should continue delivery process
if (!FailureDetector.instance.isAlive(endpoint))
{
logger.debug("Endpoint {} died during hint delivery, aborting", endpoint);
return;
}
}
try
{
compact().get();
}
catch (Exception e)
{
throw new RuntimeException(e);
}
logger.info(String.format("Finished hinted handoff of %s rows to endpoint %s", rowsReplayed, endpoint));
if (hintedHandOffPaused)
{
logger.info("Hints delivery process is paused, not delivering further hints");
}
}
4、Hint数据的清理
hint数据发送完成以后,会产生对应的DeleteColumn,然后进行Flush,和Compact。