Protobuf编码原理

目标

本文主要介绍protobuf的编码方式，包括varint编码。分析一下protobuf兼顾数据压缩和高性能的源码，本文使用protobuf版本是3.4.0。

在protobuf中序列化和反序列化是常见的操作，无论是用于存储还是网络传输。在protobuf中提供了一组序列化和反序列化的操作，比如：
byte[] toByteArray();

定义proto:

syntax ="proto3";

package com.simple;

option java_package="com.simple";
option java_outer_classname="PersonMsg";

message Person{
    int32 age= 1;
}

构造Person对象，给age赋值80并且序列化：

    @Test
    public void testSerilize() throws IOException {
        Person.Builder builder = Person.newBuilder();

        builder.setAge(400);
        
        Person person =builder.build();
        
        byte[] byteArray = person.toByteArray();

        FileOutputStream outstream = new FileOutputStream(new File("Person.txt"));
        
        outstream.write(byteArray);
        outstream.close();
    }

打开Person.txt，使用十六进制查看：08 90 03
对比xml/json，protobuf序列化后的数据十分精简，仅占了两个字节，我们好奇protobuf是如何做到这一点的？

下面介绍对小整数的编码方案：varint。

Varint编码

传输过程中出于对io的考虑，我们希望对数据进行压缩。
varint是一种对数字进行编码的方案，编码后的数据是不定长的，值越小的数字使用越小的字节数，编码后的一般占在1~5个字节。

编码规则：

1.最高位表示是否继续，继续是1，代表后面7位仍然表示数字，否则为
0，后面7位用原码补齐。
2.protobuf使用的是小字节序（小字节序解释：链接)

上面的例子：400对应的二进制为00000001 10010000（原码）

具体步骤：

1.每个字节保留后7位，去掉最高位
0000011 0010000
2.因为protobuf使用的是小字节序，所以要把低位字节写到高字节，最后一个字节高位补0，其余各字节高位补1
10010000 0000011

一开始的例子age赋值是400，所以经过varint编码后10010000 0000011,即0x90 0x03。
在protobuf中源码如下：

 final void bufferUInt32NoTag(int value) {
      if (HAS_UNSAFE_ARRAY_OPERATIONS) {
        final long originalPos = position;
        while (true) {
          if ((value & ~0x7F) == 0) {
            //最后一次取出最高位补0
            UnsafeUtil.putByte(buffer, position++, (byte) value);
            break;
          } else {
            UnsafeUtil.putByte(buffer, position++, (byte) ((value & 0x7F) | 0x80));
           //取出后面7位，最高位补1
            value >>>= 7;
            
}
        }
        int delta = (int) (position - originalPos);
        totalBytesWritten += delta;
      } else {
        while (true) {
          if ((value & ~0x7F) == 0) {
            buffer[position++] = (byte) value;
            totalBytesWritten++;
            return;
          } else {
            buffer[position++] = (byte) ((value & 0x7F) | 0x80);
            totalBytesWritten++;
            value >>>= 7;
          }
        }
      }
    }

varint的缺点:

负数需要10个字节显示（因为负数最高位是1，会被当作很大的整数处理）
eg:
在上面的例子中，我们设置age的值为-1，然后查看Person.txt文件（使用十六进制显示）

08 FF FF FF FF FF FF FF FF FF 01

除去前面key=0x08占用一个字节之外，value=-1占用了10个字节。
查看源代码：（CodeOutputStream）

   /**
     * This method does not perform bounds checking on the array. Checking array bounds is the
     * responsibility of the caller.
     */
    final void bufferInt32NoTag(final int value) {
      if (value >= 0) {
        bufferUInt32NoTag(value);
      } else {
        // Must sign-extend.
        bufferUInt64NoTag(value);
      }
    }

    final void bufferUInt64NoTag(long value) {//这里是long类型参数
      if (HAS_UNSAFE_ARRAY_OPERATIONS) {
        final long originalPos = position;
        while (true) {
          if ((value & ~0x7FL) == 0) {
            UnsafeUtil.putByte(buffer, position++, (byte) value);
            break;
          } else {
            UnsafeUtil.putByte(buffer, position++, (byte) (((int) value & 0x7F) | 0x80));
            value >>>= 7;
          }
        }
        int delta = (int) (position - originalPos);
        totalBytesWritten += delta;
         ........

从源代码可知，负数是转成了long类型，再进行varint编码，这就是占用10个字节的原因了
解决的方式：就是使用Protobuf定义的sint32/sint64类型表示负数，通过先采用Zigzag编码，将正数、负数和0都映射到无符号数，最后再采用varint编码

ZigZag编码

ZigZag是将符号数统一映射到无符号号数的一种编码方案，比如：对于0 -1 1 -2 2映射到无符号数 0 1 2 3 4。

原始值	映射值
0	0
-1	1
1	2
2	3
-2	4

对应的源代码（CodedOutputStream）

  /**
   * Encode a ZigZag-encoded 32-bit value.  ZigZag encodes signed integers
   * into values that can be efficiently encoded with varint.  (Otherwise,
   * negative values must be sign-extended to 64 bits to be varint encoded,
   * thus always taking 10 bytes on the wire.)
   *
   * @param n A signed 32-bit integer.
   * @return An unsigned 32-bit integer, stored in a signed int because
   *         Java has no explicit unsigned support.
   */
  public static int encodeZigZag32(final int n) {
    // Note:  the right-shift must be arithmetic
    return (n << 1) ^ (n >> 31);
  }

  /**
   * Encode a ZigZag-encoded 64-bit value.  ZigZag encodes signed integers
   * into values that can be efficiently encoded with varint.  (Otherwise,
   * negative values must be sign-extended to 64 bits to be varint encoded,
   * thus always taking 10 bytes on the wire.)
   *
   * @param n A signed 64-bit integer.
   * @return An unsigned 64-bit integer, stored in a signed int because
   *         Java has no explicit unsigned support.
   */
  public static long encodeZigZag64(final long n) {
    // Note:  the right-shift must be arithmetic
    return (n << 1) ^ (n >> 63);
  }

下面是解码：

/**
   * Decode a ZigZag-encoded 32-bit value. ZigZag encodes signed integers into values that can be
   * efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
   * to be varint encoded, thus always taking 10 bytes on the wire.)
   *
   * @param n An unsigned 32-bit integer, stored in a signed int because Java has no explicit
   *     unsigned support.
   * @return A signed 32-bit integer.
   */
  public static int decodeZigZag32(final int n) {
    return (n >>> 1) ^ -(n & 1);
  }

  /**
   * Decode a ZigZag-encoded 64-bit value. ZigZag encodes signed integers into values that can be
   * efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
   * to be varint encoded, thus always taking 10 bytes on the wire.)
   *
   * @param n An unsigned 64-bit integer, stored in a signed int because Java has no explicit
   *     unsigned support.
   * @return A signed 64-bit integer.
   */
  public static long decodeZigZag64(final long n) {
    return (n >>> 1) ^ -(n & 1);
  }

我们修改一下上面例子的proto文件，将age声明成sint32，修改test用例age赋值-1，查看Person.txt文件

08 01

除去key=08后，实际上-1只用了一个字节存储，相比较前面的10个字节存储，确实省去不少空间。

数据格式

Type	Meaning	Userd For
0	Varint	int32,int64,uint32,uint64,sint32,sint64,bool,enum
1	64-bit	fixed64,sfix64,double
2	Length-delimited	string,bytes,embedded messages,oacked repeated field
3	Strart Group	groups(deprecated)
4	End Group	groups(deprecated)
5	32-bit	fixed 32,sfixed32,float

protobuf是使用一种[key,value]的数据格式，key是使用该字段的field_number与wire_type取|后的值，field_number是定义proto文件时使用的tag序号。
在protobuf中的wire_type取值：

Type	Meaning	Userd For
0	Varint	int32,int64,uint32,uint64,sint32,sint64,bool,enum
1	64-bit	fixed64,sfix64,double
2	Length-delimited	string,bytes,embedded messages,oacked repeated field
3	Strart Group	groups(deprecated)
4	End Group	groups(deprecated)
5	32-bit	fixed 32,sfixed32,float

key的计算方式：

  /** Makes a tag value given a field number and wire type. */
  static int makeTag(final int fieldNumber, final int wireType) {
    return (fieldNumber << TAG_TYPE_BITS) | wireType;
  }

TAG_TYPE_BITS取值为3，也就是低位为wire_type，高位为field_number。
看下前面age=400的例子，因为age声明为int32，所以wire_type =0,key=（1<<3 | 0 ）=0x08，value=0x90 0x03 ，由于序列化采用的是直连得方式，所以编码后的值为0x08 0x90 0x03