了解了基本原理,看代码就轻松了,把EncodeInternal的完整代码贴上来:
template<bool look_for_target_matches> void VCDiffEngine::EncodeInternal(const char* target_data, size_t target_size, OutputStringInterface* diff, CodeTableWriterInterface* coder) const { if (!hashed_dictionary_) { VCD_DFATAL << "Internal error: VCDiffEngine::Encode() " "called before VCDiffEngine::Init()" << VCD_ENDL; return; } if (target_size == 0) { return; // Do nothing for empty target } // Special case for really small input if (target_size < static_cast<size_t>(BlockHash::kBlockSize)) { AddUnmatchedRemainder(target_data, target_size, coder); FinishEncoding(target_size, diff, coder); return; } RollingHash<BlockHash::kBlockSize> hasher; BlockHash* target_hash = NULL; if (look_for_target_matches) { // Check matches against previously encoded target data // in this same target window, as well as against the dictionary target_hash = BlockHash::CreateTargetHash(target_data, target_size, dictionary_size()); if (!target_hash) { VCD_DFATAL << "Instantiation of target hash failed" << VCD_ENDL; return; } } const char* const target_end = target_data + target_size; const char* const start_of_last_block = target_end - BlockHash::kBlockSize; // Offset of next bytes in string to ADD if NOT copied (i.e., not found in // dictionary) const char* next_encode = target_data; // candidate_pos points to the start of the kBlockSize-byte block that may // begin a match with the dictionary or previously encoded target data. const char* candidate_pos = target_data; uint32_t hash_value = hasher.Hash(candidate_pos); while (1) { const size_t bytes_encoded = EncodeCopyForBestMatch<look_for_target_matches>( hash_value, candidate_pos, next_encode, (target_end - next_encode), target_hash, coder); if (bytes_encoded > 0) { next_encode += bytes_encoded; // Advance past COPYed data candidate_pos = next_encode; if (candidate_pos > start_of_last_block) { break; // Reached end of target data } // candidate_pos has jumped ahead by bytes_encoded bytes, so UpdateHash // can't be used to calculate the hash value at its new position. hash_value = hasher.Hash(candidate_pos); if (look_for_target_matches) { // Update the target hash for the ADDed and COPYed data target_hash->AddAllBlocksThroughIndex( static_cast<int>(next_encode - target_data)); } } else { // No match, or match is too small to be worth a COPY instruction. // Move to the next position in the target data. if ((candidate_pos + 1) > start_of_last_block) { break; // Reached end of target data } if (look_for_target_matches) { target_hash->AddOneIndexHash( static_cast<int>(candidate_pos - target_data), hash_value); } hash_value = hasher.UpdateHash(hash_value, candidate_pos[0], candidate_pos[BlockHash::kBlockSize]); ++candidate_pos; } } AddUnmatchedRemainder(next_encode, target_end - next_encode, coder); FinishEncoding(target_size, diff, coder); delete target_hash; }
开头的几行,前文分析过了,从33行开始,是block匹配的逻辑。首先初始化几个变量:next_encode是为被编码的数据起点;candidate_pos是当前窗口的起点,之前介绍过,第一个16byte如果匹配不到block,窗口就会移动;target_end是本次函数调用传递进来的数据的重点,如果窗口的右端到了终点,本次编码就该结束了。
下面是一个循环,EncodeCopyForBestMatch函数,即之前介绍的,基于当前窗口的16个byte,找到一个匹配的block,然后尽量匹配其上下文,直到找到一个匹配最长的block,然后进行COPY编码,COPY前面可能会有一个ADD。EncodeCopyForBestMatch的返回值,是找到的最佳match数据的长度。
if (bytes_encoded>0),这个分支,表示找到的match数据,并且成功进行了编码。这里的逻辑,就是next_encode,candidate_pos分别后移,后移的值即是math数据的长度,然后检查是否已经处理完毕了,如果没完,计算新窗口的哈希。下面的if(look_for_target_matchs),表示如果需要在target数据自身里寻找match数据,那么已经编码的数据,也应该建立分段哈希,这个分段哈希称为target_hash_。
else分支,表示没有找到match(或者找到的match太短,不值得一次COPY编码)。这里的逻辑,candidate_pos_后移1,重新计算hash。然后如果需要自身匹配,加入target_hash,这里有点迷惑性,其实并不是每移动一个byte就需要增加一个block,看AddOneIndexHash的代码就清楚了,注意下面注释部分以及if判断:
// This function will be called to add blocks incrementally to the target hash // as the encoding position advances through the target data. It will be // called for every kBlockSize-byte block in the target data, regardless // of whether the block is aligned evenly on a block boundary. The // BlockHash will only store hash entries for the evenly-aligned blocks. // void AddOneIndexHash(int index, uint32_t hash_value) { if (index == NextIndexToAdd()) { AddBlock(hash_value); } }
循环结束后,需要对最后残留的不够一block的数据,通过函数AddUnmatchedRemainder直接编码为VCDiff的ADD,然后FinishEncoding写缓冲区。
然后再看一下之循环开头的EncodeCopyForBestMatch函数,代码:
template<bool look_for_target_matches> inline size_t VCDiffEngine::EncodeCopyForBestMatch( uint32_t hash_value, const char* target_candidate_start, const char* unencoded_target_start, size_t unencoded_target_size, const BlockHash* target_hash, CodeTableWriterInterface* coder) const { // When FindBestMatch() comes up with a match for a candidate block, // it will populate best_match with the size, source offset, // and target offset of the match. BlockHash::Match best_match; // First look for a match in the dictionary. hashed_dictionary_->FindBestMatch(hash_value, target_candidate_start, unencoded_target_start, unencoded_target_size, &best_match); // If target matching is enabled, then see if there is a better match // within the target data that has been encoded so far. if (look_for_target_matches) { target_hash->FindBestMatch(hash_value, target_candidate_start, unencoded_target_start, unencoded_target_size, &best_match); } if (!ShouldGenerateCopyInstructionForMatchOfSize(best_match.size())) { return 0; } if (best_match.target_offset() > 0) { // Create an ADD instruction to encode all target bytes // from the end of the last COPY match, if any, up to // the beginning of this COPY match. coder->Add(unencoded_target_start, best_match.target_offset()); } coder->Copy(best_match.source_offset(), best_match.size()); return best_match.target_offset() // ADD size + best_match.size(); // + COPY size }
这个函数本身没有太多信息量,首先调用FindBestMatch,找到最佳的match数据,然后就是ADD,COPY。重头戏都交给FindBestMatch了,我们来看一下这个函数的代码:
void BlockHash::FindBestMatch(uint32_t hash_value, const char* target_candidate_start, const char* target_start, size_t target_size, Match* best_match) const { int match_counter = 0; for (int block_number = FirstMatchingBlockInline(hash_value, target_candidate_start); (block_number >= 0) && !TooManyMatches(&match_counter); block_number = NextMatchingBlock(block_number, target_candidate_start)) { int source_match_offset = block_number * kBlockSize; const int source_match_end = source_match_offset + kBlockSize; int target_match_offset = static_cast<int>(target_candidate_start - target_start); const int target_match_end = target_match_offset + kBlockSize; size_t match_size = kBlockSize; { // Extend match start towards beginning of unencoded data const int limit_bytes_to_left = std::min(source_match_offset, target_match_offset); const int matching_bytes_to_left = MatchingBytesToLeft(source_data_ + source_match_offset, target_start + target_match_offset, limit_bytes_to_left); source_match_offset -= matching_bytes_to_left; target_match_offset -= matching_bytes_to_left; match_size += matching_bytes_to_left; } { // Extend match end towards end of unencoded data const size_t source_bytes_to_right = source_size_ - source_match_end; const size_t target_bytes_to_right = target_size - target_match_end; const size_t limit_bytes_to_right = std::min(source_bytes_to_right, target_bytes_to_right); match_size += MatchingBytesToRight(source_data_ + source_match_end, target_start + target_match_end, static_cast<int>(limit_bytes_to_right)); } // Update in/out parameter if the best match found was better // than any match already stored in *best_match. best_match->ReplaceIfBetterMatch(match_size, source_match_offset + starting_offset_, target_match_offset); } }
未完,见下篇