Protobuf编码原理

目标

本文主要介绍protobuf的编码方式,包括varint编码。分析一下protobuf兼顾数据压缩和高性能的源码,本文使用protobuf版本是3.4.0。

在protobuf中序列化和反序列化是常见的操作,无论是用于存储还是网络传输。在protobuf中提供了一组序列化和反序列化的操作,比如:
byte[] toByteArray();

定义proto:

syntax ="proto3";

package com.simple;

option java_package="com.simple";
option java_outer_classname="PersonMsg";

message Person{
    int32 age= 1;
}

构造Person对象,给age赋值80并且序列化:

    @Test
    public void testSerilize() throws IOException {
        Person.Builder builder = Person.newBuilder();

        builder.setAge(400);
        
        Person person =builder.build();
        
        byte[] byteArray = person.toByteArray();

        FileOutputStream outstream = new FileOutputStream(new File("Person.txt"));
        
        outstream.write(byteArray);
        outstream.close();
    }

打开Person.txt,使用十六进制查看:08 90 03
对比xml/json,protobuf序列化后的数据十分精简,仅占了两个字节,我们好奇protobuf是如何做到这一点的?


下面介绍对小整数的编码方案:varint。

Varint编码

传输过程中出于对io的考虑,我们希望对数据进行压缩。
varint是一种对数字进行编码的方案,编码后的数据是不定长的,值越小的数字使用越小的字节数,编码后的一般占在1~5个字节。

编码规则:

  • 1.最高位表示是否继续,继续是1,代表后面7位仍然表示数字,否则为
    0,后面7位用原码补齐。
  • 2.protobuf使用的是小字节序(小字节序解释:链接)

上面的例子:400对应的二进制为00000001 10010000(原码)

具体步骤:

  • 1.每个字节保留后7位,去掉最高位
    0000011 0010000
  • 2.因为protobuf使用的是小字节序,所以要把低位字节写到高字节,最后一个字节高位补0,其余各字节高位补1
    10010000 0000011

一开始的例子age赋值是400,所以经过varint编码后10010000 0000011,即0x90 0x03。
在protobuf中源码如下:

 final void bufferUInt32NoTag(int value) {
      if (HAS_UNSAFE_ARRAY_OPERATIONS) {
        final long originalPos = position;
        while (true) {
          if ((value & ~0x7F) == 0) {
            //最后一次取出最高位补0
            UnsafeUtil.putByte(buffer, position++, (byte) value);
            break;
          } else {
            UnsafeUtil.putByte(buffer, position++, (byte) ((value & 0x7F) | 0x80));
           //取出后面7位,最高位补1
            value >>>= 7;
            
}
        }
        int delta = (int) (position - originalPos);
        totalBytesWritten += delta;
      } else {
        while (true) {
          if ((value & ~0x7F) == 0) {
            buffer[position++] = (byte) value;
            totalBytesWritten++;
            return;
          } else {
            buffer[position++] = (byte) ((value & 0x7F) | 0x80);
            totalBytesWritten++;
            value >>>= 7;
          }
        }
      }
    }

varint的缺点:

负数需要10个字节显示(因为负数最高位是1,会被当作很大的整数处理)
eg:
在上面的例子中,我们设置age的值为-1,然后查看Person.txt文件(使用十六进制显示)

08 FF FF FF FF FF FF FF FF FF 01

除去前面key=0x08占用一个字节之外,value=-1占用了10个字节。
查看源代码:(CodeOutputStream)

   /**
     * This method does not perform bounds checking on the array. Checking array bounds is the
     * responsibility of the caller.
     */
    final void bufferInt32NoTag(final int value) {
      if (value >= 0) {
        bufferUInt32NoTag(value);
      } else {
        // Must sign-extend.
        bufferUInt64NoTag(value);
      }
    }
    final void bufferUInt64NoTag(long value) {//这里是long类型参数
      if (HAS_UNSAFE_ARRAY_OPERATIONS) {
        final long originalPos = position;
        while (true) {
          if ((value & ~0x7FL) == 0) {
            UnsafeUtil.putByte(buffer, position++, (byte) value);
            break;
          } else {
            UnsafeUtil.putByte(buffer, position++, (byte) (((int) value & 0x7F) | 0x80));
            value >>>= 7;
          }
        }
        int delta = (int) (position - originalPos);
        totalBytesWritten += delta;
         ........

从源代码可知,负数是转成了long类型,再进行varint编码,这就是占用10个字节的原因了
解决的方式:就是使用Protobuf定义的sint32/sint64类型表示负数,通过先采用Zigzag编码,将正数、负数和0都映射到无符号数,最后再采用varint编码


ZigZag编码

ZigZag是将符号数统一映射到无符号号数的一种编码方案,比如:对于0 -1 1 -2 2映射到无符号数 0 1 2 3 4。

原始值 映射值
0 0
-1 1
1 2
2 3
-2 4

对应的源代码(CodedOutputStream)

  /**
   * Encode a ZigZag-encoded 32-bit value.  ZigZag encodes signed integers
   * into values that can be efficiently encoded with varint.  (Otherwise,
   * negative values must be sign-extended to 64 bits to be varint encoded,
   * thus always taking 10 bytes on the wire.)
   *
   * @param n A signed 32-bit integer.
   * @return An unsigned 32-bit integer, stored in a signed int because
   *         Java has no explicit unsigned support.
   */
  public static int encodeZigZag32(final int n) {
    // Note:  the right-shift must be arithmetic
    return (n << 1) ^ (n >> 31);
  }

  /**
   * Encode a ZigZag-encoded 64-bit value.  ZigZag encodes signed integers
   * into values that can be efficiently encoded with varint.  (Otherwise,
   * negative values must be sign-extended to 64 bits to be varint encoded,
   * thus always taking 10 bytes on the wire.)
   *
   * @param n A signed 64-bit integer.
   * @return An unsigned 64-bit integer, stored in a signed int because
   *         Java has no explicit unsigned support.
   */
  public static long encodeZigZag64(final long n) {
    // Note:  the right-shift must be arithmetic
    return (n << 1) ^ (n >> 63);
  }

下面是解码:

/**
   * Decode a ZigZag-encoded 32-bit value. ZigZag encodes signed integers into values that can be
   * efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
   * to be varint encoded, thus always taking 10 bytes on the wire.)
   *
   * @param n An unsigned 32-bit integer, stored in a signed int because Java has no explicit
   *     unsigned support.
   * @return A signed 32-bit integer.
   */
  public static int decodeZigZag32(final int n) {
    return (n >>> 1) ^ -(n & 1);
  }

  /**
   * Decode a ZigZag-encoded 64-bit value. ZigZag encodes signed integers into values that can be
   * efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
   * to be varint encoded, thus always taking 10 bytes on the wire.)
   *
   * @param n An unsigned 64-bit integer, stored in a signed int because Java has no explicit
   *     unsigned support.
   * @return A signed 64-bit integer.
   */
  public static long decodeZigZag64(final long n) {
    return (n >>> 1) ^ -(n & 1);
  }

我们修改一下上面例子的proto文件,将age声明成sint32,修改test用例age赋值-1,查看Person.txt文件

08 01

除去key=08后,实际上-1只用了一个字节存储,相比较前面的10个字节存储,确实省去不少空间。


数据格式

Type Meaning Userd For
0 Varint int32,int64,uint32,uint64,sint32,sint64,bool,enum
1 64-bit fixed64,sfix64,double
2 Length-delimited string,bytes,embedded messages,oacked repeated field
3 Strart Group groups(deprecated)
4 End Group groups(deprecated)
5 32-bit fixed 32,sfixed32,float

protobuf是使用一种[key,value]的数据格式,key是使用该字段的field_number与wire_type取|后的值,field_number是定义proto文件时使用的tag序号。
在protobuf中的wire_type取值:

Type Meaning Userd For
0 Varint int32,int64,uint32,uint64,sint32,sint64,bool,enum
1 64-bit fixed64,sfix64,double
2 Length-delimited string,bytes,embedded messages,oacked repeated field
3 Strart Group groups(deprecated)
4 End Group groups(deprecated)
5 32-bit fixed 32,sfixed32,float

key的计算方式:

  /** Makes a tag value given a field number and wire type. */
  static int makeTag(final int fieldNumber, final int wireType) {
    return (fieldNumber << TAG_TYPE_BITS) | wireType;
  }

TAG_TYPE_BITS取值为3,也就是低位为wire_type,高位为field_number。
看下前面age=400的例子,因为age声明为int32,所以wire_type =0,key=(1<<3 | 0 )=0x08,value=0x90 0x03 ,由于序列化采用的是直连得方式,所以编码后的值为0x08 0x90 0x03

你可能感兴趣的:(Protobuf编码原理)