接上篇,本篇开始关注StreamingEncoder的实际工作过程,首先是构造函数,代码:
VCDiffStreamingEncoder::VCDiffStreamingEncoder( const HashedDictionary* dictionary, VCDiffFormatExtensionFlags format_extensions, bool look_for_target_matches) : impl_(new VCDiffStreamingEncoderImpl(dictionary, format_extensions, look_for_target_matches)) { }
以上代码,第一个参数是之前构造好的HashedDictionary;第二个参数是表示二进制输出格式的选项,正如前文描述,open-vcdiff为了sdch应用,对rfc3284进行了部分扩充,扩充的部分可以在这个参数里体现;第三个布尔型参数,表示在编码过程中寻找字符串匹配时,是否在target内容自身中寻找匹配的block(如果sdch配合gzip一起使用,第三个参数可以传递false,并不影响最终的压缩效率)。结下来,代码调用VCDiffStreamingEncoderImpl的构造函数初始化指针变量impl_,构造函数的代码如下:
inline VCDiffStreamingEncoderImpl::VCDiffStreamingEncoderImpl( const HashedDictionary* dictionary, VCDiffFormatExtensionFlags format_extensions, bool look_for_target_matches) : engine_(dictionary->engine()), format_extensions_(format_extensions), look_for_target_matches_(look_for_target_matches), encode_chunk_allowed_(false) { if (format_extensions & VCD_FORMAT_JSON) { coder_.reset(new JSONCodeTableWriter()); } else { // This implementation of the encoder uses the default // code table. A VCDiffCodeTableWriter could also be constructed // using a custom code table. coder_.reset(new VCDiffCodeTableWriter( (format_extensions & VCD_FORMAT_INTERLEAVED) != 0)); } }
上述代码初始化通过构造函数初始化列表,初始化一系列内部变量,注意engine_变量的赋值,这个类同样聚集了类VCDiffEngine,参考之前的UML图。在函数正文中对coder_进行reset,coder_是采用std::auto_ptr包装的成员变量,其定义如下:
std::auto_ptr<CodeTableWriterInterface> coder_;
而VCDiffCodeTableWriter是一个底层的二进制输出接口,将编码逻辑的语义输出,转化为二进制表示,以下是部分代码摘录。可以看到,其具体的功能包括添加编码头,编码尾,编码ADD,COPY,RUN指令,增加checksum等,这部分主要涉及二进制数据的拼接,我们不再深入研究。
class VCDiffCodeTableWriter : public CodeTableWriterInterface { public: virtual void WriteHeader(OutputStringInterface* out, VCDiffFormatExtensionFlags format_extensions); // Encode an ADD opcode with the "size" bytes starting at data virtual void Add(const char* data, size_t size); // Encode a COPY opcode with args "offset" (into dictionary) and "size" bytes. virtual void Copy(int32_t offset, size_t size); // Encode a RUN opcode for "size" copies of the value "byte". virtual void Run(size_t size, unsigned char byte); virtual void AddChecksum(VCDChecksum checksum) { add_checksum_ = true; checksum_ = checksum; } virtual void FinishEncoding(OutputStringInterface* /*out*/) {} };
接下来,看下一三个实际编码操作的函数VCDiffStreamingEncoder::StartEncoding,EncodeChunk,以及FinishEncoding。首先是StartEncoding,这个函数主要拼一份VCDiff的头出来。这个函数会调用EncodeChunkToInterface,然后是impl_->StartEncoding,代码一并贴出来:
template<class OutputType> bool StartEncoding(OutputType* output) { OutputString<OutputType> output_string(output); return StartEncodingToInterface(&output_string); } bool VCDiffStreamingEncoder::StartEncodingToInterface( OutputStringInterface* out) { return impl_->StartEncoding(out); } inline bool VCDiffStreamingEncoderImpl::StartEncoding( OutputStringInterface* out) { if (!coder_->Init(engine_->dictionary_size())) { VCD_DFATAL << "Internal error: " "Initialization of code table writer failed" << VCD_ENDL; return false; } coder_->WriteHeader(out, format_extensions_); encode_chunk_allowed_ = true; return true; }
以上代码的逻辑比较简单,层层调用后,最终采用coder_->WriteHeader()函数,这个前文介绍过,就是底层的二进制拼接,不再深究。
然后是EncodeChunk函数,终于开始编码了,函数的2个参数分别是待编码数据的指针与长度。同样的风格,首先调用EncodeChunkToInterface,然后是impl_->EncodeChunk(),代码如下:
template<class OutputType> bool EncodeChunk(const char* data, size_t len, OutputType* output) { OutputString<OutputType> output_string(output); return EncodeChunkToInterface(data, len, &output_string); } bool VCDiffStreamingEncoder::EncodeChunkToInterface( const char* data, size_t len, OutputStringInterface* out) { return impl_->EncodeChunk(data, len, out); } inline bool VCDiffStreamingEncoderImpl::EncodeChunk( const char* data, size_t len, OutputStringInterface* out) { if (!encode_chunk_allowed_) { VCD_ERROR << "EncodeChunk called before StartEncoding" << VCD_ENDL; return false; } if ((format_extensions_ & VCD_FORMAT_CHECKSUM) != 0) { coder_->AddChecksum(ComputeAdler32(data, len)); } engine_->Encode(data, len, look_for_target_matches_, out, coder_.get()); return true; }
在上面的impl_->EncodeChunk中,会check是否设置了VCD_FORMAT_CHECKSUM,如果设置了,需要在二进制中补充checksum,这个特性是open-vcdiff为sdch做的扩展。最后会把数据交给engine_->Encode进行处理,还记得之前的UML图吗,这个engine_与HashedDictionary里聚集的是同一个。我们来看一下VCDiffEngine::Encode的代码实现:
void VCDiffEngine::Encode(const char* target_data, size_t target_size, bool look_for_target_matches, OutputStringInterface* diff, CodeTableWriterInterface* coder) const { if (look_for_target_matches) { EncodeInternal<true>(target_data, target_size, diff, coder); } else { EncodeInternal<false>(target_data, target_size, diff, coder); } }
其中look_for_target_matches即是VCDiffStreamingEncoder构造函数的第三个参数赋的值,其作用之前介绍过了,一般我们设为fasle。我们继续看EncoderInternal函数,先贴部分代码:
template<bool look_for_target_matches> void VCDiffEngine::EncodeInternal(const char* target_data, size_t target_size, OutputStringInterface* diff, CodeTableWriterInterface* coder) const { if (!hashed_dictionary_) { VCD_DFATAL << "Internal error: VCDiffEngine::Encode() " "called before VCDiffEngine::Init()" << VCD_ENDL; return; } if (target_size == 0) { return; // Do nothing for empty target } // Special case for really small input if (target_size < static_cast<size_t>(BlockHash::kBlockSize)) { AddUnmatchedRemainder(target_data, target_size, coder); FinishEncoding(target_size, diff, coder); return; } RollingHash<BlockHash::kBlockSize> hasher; BlockHash* target_hash = NULL; if (look_for_target_matches) { // Check matches against previously encoded target data // in this same target window, as well as against the dictionary target_hash = BlockHash::CreateTargetHash(target_data, target_size, dictionary_size()); if (!target_hash) { VCD_DFATAL << "Instantiation of target hash failed" << VCD_ENDL; return; } } ......
首先如果target数据的长度,不够一个block,直接AddUnmatchedRemainder,我们看一下这个函数的代码:
inline void VCDiffEngine::AddUnmatchedRemainder( const char* unencoded_target_start, size_t unencoded_target_size, CodeTableWriterInterface* coder) const { if (unencoded_target_size > 0) { coder->Add(unencoded_target_start, unencoded_target_size); } }
这里直接采用VCDIFF的ADD指令进行了编码,从增量压缩的角度,相当于没有进行压缩,直接原文拷贝。所以说,进行流式压缩时,要避免这种每次传递给编码器数据过短的情况。然后FinishEncoding,把编码的数据写入到输出接口,代码如下:
inline void VCDiffEngine::FinishEncoding( size_t target_size, OutputStringInterface* diff, CodeTableWriterInterface* coder) const { if (target_size != static_cast<size_t>(coder->target_length())) { VCD_DFATAL << "Internal error in VCDiffEngine::Encode: " "original target size (" << target_size << ") does not match number of bytes processed (" << coder->target_length() << ")" << VCD_ENDL; } coder->Output(diff); }
接着,我们回到VCDiffEngine::EncodeInternal的代码继续分析,其中if(look_for_target_matchs)里的部分,会对参数传递来的target数据像之前的dict数据一样,分block建立哈希表,供后续匹配用,我们先忽略这段逻辑,看下面的字符串匹配编码过程。
介绍下面的代码前,首先阐述一下原理。对于target数据,采用block size(16bytes)为窗口,从起始字节开始采用之前对dict分段hash相同的算法计算该窗口内数据的哈希值。采用该哈希值检索dict哈希表,看该16个bytes是否在dict中能找到一个完全相同的block。如果没有找到,窗口向后移动一个byte,重新计算哈希值,继续在哈希表中寻找。
如果找到一个完全相同的block,则check该block在dict中的上下文,以及当前窗口覆盖的target数据的上下文,即把匹配的16byte数据分别向前、向后扩展,匹配尽可能多的字符。找到这段尽可能长的match bytes后,并不会直接编码为COPY,因为当前窗口覆盖的16个bytes,或许可以匹配到另外一个block,从这个block进行上下文扩展匹配,或许可以做到更长的match bytes。因此,这里会继续向后寻找可以匹配的block,代码实现上,沿着同哈希那个拉链表向后寻找即可。如果能找到新的匹配的block,同样检查它的上下文,进行扩展。重复这个过程,直至找到最长的一段match,编码为COPY。再窗口移动过程中被skip的byte,编码为ADD。
未完,请参考下一篇。