open-vcdiff(官方主页),是Google的一个开源项目,提供了VCDIFF(rfc3284)的一种开源实现。VCDIFF是一种增量压缩算法,其编码过程可以将目标文件(target)基于一个字典(dictionary)文件,编码生成增量(delta)文件。解码过程则是由dictionary及delta还原target。
VCDIFF通过增量压缩,只传输delta,通常比一般的压缩算法的效率更高。基于VCDIFF,Google提出了一种http1.1兼容的内容编码标准,称为sdch,而本文将分析的open-vcdiff也正是为sdch而生。这里要注意,为了更适应http协议的流式传输特性,sdch即open-vcdiff的内容编码方案相比rfc3284略有不同,包括采用了interleaved format以及增加了adler32 checksum等。
本文主要关注open-vcdiff的流式编码过程,即StreamingEncoder类的工作原理。如果你对VCDiff完全没有概念,建议花5分钟先阅读rfc3284中 "3. Delta Instructions"这一节的例子,了解ADD,COPY,RUN指令的含义,否则可能会影响理解效果。
开始看代码,本文解析的代码版本为0.8.3,首先膜拜一下作者,打开代码大牛Jeff Dean的名字赫然在目:
/* rolling_hash.h */ Copyright 2007, 2008 Google Inc. Authors: Jeff Dean, Sanjay Ghemawat, Lincoln Smith
言归正传,流式编码的使用方法,官方给出的手册如下:
// The client should use these routines as follows: HashedDictionary hd(dictionary, dictionary_size); if (!hd.Init()) { HandleError(); return; } string output_string; VCDiffStreamingEncoder v(hd, false, false); if (!v.StartEncoding(&output_string)) { HandleError(); return; // No need to call FinishEncoding() } Process(output_string.data(), output_string.size()); output_string.clear(); while (get data_buf) { if (!v.EncodeChunk(data_buf, data_len, &output_string)) { HandleError(); return; // No need to call FinishEncoding() } // The encoding is appended to output_string at each call, // so clear output_string once its contents have been processed. Process(output_string.data(), output_string.size()); output_string.clear(); } if (!v.FinishEncoding(&output_string)) { HandleError(); return; } Process(output_string.data(), output_string.size()); output_string.clear();
可以看到,整个编码过程,向用户暴露的只有2个类HashedDictionary、VCDiffStreamingEncoder。前者首先以字典作为参数进行构造,然后须调用Init方法。后者采用前者作为参数进行初始化,然后通过StartEncoding,EncodingChunk,FinishEncoding三个成员函数完成流式编码过程。
为了解析其内部运转原理,首先给出UML图:
HashedDictonary会聚集一个VCDiffEngine对象,HashedDictionary构造函数会调用VCDiffEngine的构造函数,传递dict的指针及size,VCDiffEngine会新开内存保存。成员变量hashed_dictionary_暂时置空,注意此变量的类型为BlockHash。代码如下:
VCDiffEngine::VCDiffEngine(const char* dictionary, size_t dictionary_size) // If dictionary_size == 0, then dictionary could be NULL. Guard against // using a NULL value. : dictionary_((dictionary_size > 0) ? new char[dictionary_size] : ""), dictionary_size_(dictionary_size), hashed_dictionary_(NULL) { if (dictionary_size > 0) { memcpy(const_cast<char*>(dictionary_), dictionary, dictionary_size); } }
接下来,客户代码会调用HashedDictonary的Init()方法,如UML图中所示,该方法同样传递至类VCDiffEngine的Init()方法。进一步,调用静态方法BlockHash::CreateDictionaryHash来构建BlockHash对象,并赋值给hashed_dictionary_,代码如下:
bool VCDiffEngine::Init() { if (hashed_dictionary_) { VCD_DFATAL << "Init() called twice for same VCDiffEngine object" << VCD_ENDL; return false; } hashed_dictionary_ = BlockHash::CreateDictionaryHash(dictionary_, dictionary_size()); if (!hashed_dictionary_) { VCD_DFATAL << "Creation of dictionary hash failed" << VCD_ENDL; return false; } RollingHash<BlockHash::kBlockSize>::Init(); return true; }
BlockHash::CreateDictionaryHash函数的伪代码可以参考之前uml图中的代码注释。仍热是首先new,再Init()。new的过程只是初始化指向字典的指针,我们重点关注一下Init方法。我们知道VCDIFF的增量编码很大程度上要依赖于target内容与dict内容的字符串匹配,为了加快匹配速度,必然需要对dict的内容进行预处理,并用合适的数据结构进行存储。
该预处理的过程即通过BlockHash::Init()函数实现,该函数会对dict划分为16byte为单位的block,对每个block计算其哈希值,存入哈希表,并采用拉链发解决冲突。哈希表的的实现,主要依靠3个数组成员变量。hash_table_这是主数组,以哈希值为index存放block的index,当然是该哈希值对应的第一片index。2个辅助数组:next_block_table_以及last_block_table_,分别存放同哈希值的下一block的index,最后一个block的index。
举个例子:block[index a]、block[index b]的哈希值均为h,它们按照顺序被加入哈希表,当block[index a]被加入哈希时,有如下逻辑:
hash_table_[h] = a; // a是具有同hash值的链表的首位 last_block_table_[a] = a; // 通过链表首对应的链表尾,还是a
当block[index b]被加入哈希时,此时a已经在哈希里,即hash_table_[h]已经被a占据链表首位,此时的逻辑:
next_block_table_[a] = b; // a的下一个是节点是b last_block_table_[a] = b; // 以a为首的链表终结于b
下面看一下BlockHash::CreateDictionaryHash的代码:
bool BlockHash::Init(bool populate_hash_table) { if (!hash_table_.empty() || !next_block_table_.empty() || !last_block_table_.empty()) { VCD_DFATAL << "Init() called twice for same BlockHash object" << VCD_ENDL; return false; } const size_t table_size = CalcTableSize(source_size_); if (table_size == 0) { VCD_DFATAL << "Error finding table size for source size " << source_size_ << VCD_ENDL; return false; } // Since table_size is a power of 2, (table_size - 1) is a bit mask // containing all the bits below table_size. hash_table_mask_ = static_cast<uint32_t>(table_size - 1); hash_table_.resize(table_size, -1); next_block_table_.resize(GetNumberOfBlocks(), -1); last_block_table_.resize(GetNumberOfBlocks(), -1); if (populate_hash_table) { AddAllBlocks(); }
有了之前的原理铺垫,解释起来就容易了。其中CalcTableSize()用于计算哈希表的大小,注意这里的哈希表需要存储的元素数目即blocknum应该是source_size_/16。综合考虑冲突避免,内存节省两个因素,open-vcdiff采取的算法是首先计算min_size = source_size_/sizeof(int) + 1。这个基本是除以4了,然后min_size开始向上寻找,遇到第一个2的整数次幂即停止,该整数次幂作为哈希表的size。关于hash_table_mask_的作用,还记得之前说的由hash值作为hash_table_数组的index吗,其实并不是直接作为index而是与hash_table_mask_位与一下,具体可以参考函数GetHashTableIndex。GetNumberOfBlocks()名字已经说的很清楚了,就是返回source_size_/16。AddAllBlocks(),即是实际把各个block进行哈希计算然后加入哈希表的过程,期间又调用了AddAllBlocksThroughtIndex,我们把代码一并贴上来:
void BlockHash::AddAllBlocks() { AddAllBlocksThroughIndex(static_cast<int>(source_size_)); } void BlockHash::AddAllBlocksThroughIndex(int end_index) { if (end_index > static_cast<int>(source_size_)) { VCD_DFATAL << "BlockHash::AddAllBlocksThroughIndex() called" " with index " << end_index << " higher than end index " << source_size_ << VCD_ENDL; return; } const int last_index_added = last_block_added_ * kBlockSize; if (end_index <= last_index_added) { VCD_DFATAL << "BlockHash::AddAllBlocksThroughIndex() called" " with index " << end_index << " <= last index added ( " << last_index_added << ")" << VCD_ENDL; return; } int end_limit = end_index; // Don't allow reading any indices at or past source_size_. // The Hash function extends (kBlockSize - 1) bytes past the index, // so leave a margin of that size. int last_legal_hash_index = static_cast<int>(source_size() - kBlockSize); if (end_limit > last_legal_hash_index) { end_limit = last_legal_hash_index + 1; } const char* block_ptr = source_data() + NextIndexToAdd(); const char* const end_ptr = source_data() + end_limit; while (block_ptr < end_ptr) { AddBlock(RollingHash<kBlockSize>::Hash(block_ptr)); block_ptr += kBlockSize; } }
AddAllBlocks首先调用AddAllBlocksThroughIndex,注意参数,是dict的长度。AddAllBlocksThroughIndex的作用是从(last_block_added+1)*blocksize作为起始,一直到参数的作为终止,期间所有完整的block,均会被顺序加入哈希表,至于末尾不够一个block的数据,则直接忽略掉。AddAllBlocksThroughIndex函数的末尾,通过一个循环调用AddBlock将具体的block加入哈希表,在参数里,又调用了RollingHash<kBlockSize>::Hash函数计算block的哈希值。我们以此看一下这2个函数,首先是AddBlock:
void BlockHash::AddBlock(uint32_t hash_value) { if (hash_table_.empty()) { VCD_DFATAL << "BlockHash::AddBlock() called before BlockHash::Init()" << VCD_ENDL; return; } // The initial value of last_block_added_ is -1. int block_number = last_block_added_ + 1; const int total_blocks = static_cast<int>(source_size_ / kBlockSize); // round down if (block_number >= total_blocks) { VCD_DFATAL << "BlockHash::AddBlock() called" " with block number " << block_number << " that is past last block " << (total_blocks - 1) << VCD_ENDL; return; } if (next_block_table_[block_number] != -1) { VCD_DFATAL << "Internal error in BlockHash::AddBlock(): " "block number = " << block_number << ", next block should be -1 but is " << next_block_table_[block_number] << VCD_ENDL; return; } const uint32_t hash_table_index = GetHashTableIndex(hash_value); const int first_matching_block = hash_table_[hash_table_index]; if (first_matching_block < 0) { // This is the first entry with this hash value hash_table_[hash_table_index] = block_number; last_block_table_[block_number] = block_number; } else { // Add this entry at the end of the chain of matching blocks const int last_matching_block = last_block_table_[first_matching_block]; if (next_block_table_[last_matching_block] != -1) { VCD_DFATAL << "Internal error in BlockHash::AddBlock(): " "first matching block = " << first_matching_block << ", last matching block = " << last_matching_block << ", next block should be -1 but is " << next_block_table_[last_matching_block] << VCD_ENDL; return; } next_block_table_[last_matching_block] = block_number; last_block_table_[first_matching_block] = block_number; } last_block_added_ = block_number; }
上述函数就是之前描述的加入哈希表,然后拉链解决冲突的过程,具体不再解释了,代码应该比较清楚。然后再来看一下哈希的计算函数RollingHash<kBlockSize>::Hash,代码如下:
// Compute a hash of the window "ptr[0, window_size - 1]". static uint32_t Hash(const char* ptr) { uint32_t h = RollingHashUtil::HashFirstTwoBytes(ptr); for (int i = 2; i < window_size; ++i) { h = RollingHashUtil::HashStep(h, ptr[i]); } return h; }
这里的window_size是模板参数,在本例中即是Block的size 16。就是一个滚动技术哈希的过程,看一下涉及的2个函数,HashFirstTwoBytes和HashStep,其中又会设计部分常量及取模操作,代码如下:
// Multiplier for incremental hashing. The compiler should be smart enough to // convert (val * kMult) into ((val << 8) + val). static const uint32_t kMult = 257; // All hashes are returned modulo "kBase". Current implementation requires // kBase <= 2^32/kMult to avoid overflow. Also, kBase must be a power of two // so that we can compute modulus efficiently. static const uint32_t kBase = (1 << 23); // Returns operand % kBase, assuming that kBase is a power of two. static inline uint32_t ModBase(uint32_t operand) { return operand & (kBase - 1); } static inline uint32_t HashFirstTwoBytes(const char* ptr) { return (static_cast<unsigned char>(ptr[0]) * kMult) + static_cast<unsigned char>(ptr[1]); } static inline uint32_t HashStep(uint32_t partial_hash, unsigned char next_byte) { return ModBase((partial_hash * kMult) + next_byte); }
以上介绍了HashDictionary初始化后的流程以及数据结构设计,下篇会继续介绍实际编码流程。