目标
本文主要介绍protobuf的编码方式,包括varint编码。分析一下protobuf兼顾数据压缩和高性能的源码,本文使用protobuf版本是3.4.0。
在protobuf中序列化和反序列化是常见的操作,无论是用于存储还是网络传输。在protobuf中提供了一组序列化和反序列化的操作,比如:
byte[] toByteArray();
定义proto:
syntax ="proto3";
package com.simple;
option java_package="com.simple";
option java_outer_classname="PersonMsg";
message Person{
int32 age= 1;
}
构造Person对象,给age赋值80并且序列化:
@Test
public void testSerilize() throws IOException {
Person.Builder builder = Person.newBuilder();
builder.setAge(400);
Person person =builder.build();
byte[] byteArray = person.toByteArray();
FileOutputStream outstream = new FileOutputStream(new File("Person.txt"));
outstream.write(byteArray);
outstream.close();
}
打开Person.txt,使用十六进制查看:08 90 03
对比xml/json,protobuf序列化后的数据十分精简,仅占了两个字节,我们好奇protobuf是如何做到这一点的?
下面介绍对小整数的编码方案:varint。
Varint编码
传输过程中出于对io的考虑,我们希望对数据进行压缩。
varint是一种对数字进行编码的方案,编码后的数据是不定长的,值越小的数字使用越小的字节数,编码后的一般占在1~5个字节。
编码规则:
- 1.最高位表示是否继续,继续是1,代表后面7位仍然表示数字,否则为
0,后面7位用原码补齐。 - 2.protobuf使用的是小字节序(小字节序解释:链接)
上面的例子:400对应的二进制为00000001 10010000(原码)
具体步骤:
- 1.每个字节保留后7位,去掉最高位
0000011 0010000 - 2.因为protobuf使用的是小字节序,所以要把低位字节写到高字节,最后一个字节高位补0,其余各字节高位补1
10010000 0000011
一开始的例子age赋值是400,所以经过varint编码后10010000 0000011,即0x90 0x03。
在protobuf中源码如下:
final void bufferUInt32NoTag(int value) {
if (HAS_UNSAFE_ARRAY_OPERATIONS) {
final long originalPos = position;
while (true) {
if ((value & ~0x7F) == 0) {
//最后一次取出最高位补0
UnsafeUtil.putByte(buffer, position++, (byte) value);
break;
} else {
UnsafeUtil.putByte(buffer, position++, (byte) ((value & 0x7F) | 0x80));
//取出后面7位,最高位补1
value >>>= 7;
}
}
int delta = (int) (position - originalPos);
totalBytesWritten += delta;
} else {
while (true) {
if ((value & ~0x7F) == 0) {
buffer[position++] = (byte) value;
totalBytesWritten++;
return;
} else {
buffer[position++] = (byte) ((value & 0x7F) | 0x80);
totalBytesWritten++;
value >>>= 7;
}
}
}
}
varint的缺点:
负数需要10个字节显示(因为负数最高位是1,会被当作很大的整数处理)
eg:
在上面的例子中,我们设置age的值为-1,然后查看Person.txt文件(使用十六进制显示)
08 FF FF FF FF FF FF FF FF FF 01
除去前面key=0x08占用一个字节之外,value=-1占用了10个字节。
查看源代码:(CodeOutputStream)
/**
* This method does not perform bounds checking on the array. Checking array bounds is the
* responsibility of the caller.
*/
final void bufferInt32NoTag(final int value) {
if (value >= 0) {
bufferUInt32NoTag(value);
} else {
// Must sign-extend.
bufferUInt64NoTag(value);
}
}
final void bufferUInt64NoTag(long value) {//这里是long类型参数
if (HAS_UNSAFE_ARRAY_OPERATIONS) {
final long originalPos = position;
while (true) {
if ((value & ~0x7FL) == 0) {
UnsafeUtil.putByte(buffer, position++, (byte) value);
break;
} else {
UnsafeUtil.putByte(buffer, position++, (byte) (((int) value & 0x7F) | 0x80));
value >>>= 7;
}
}
int delta = (int) (position - originalPos);
totalBytesWritten += delta;
........
从源代码可知,负数是转成了long类型,再进行varint编码,这就是占用10个字节的原因了
解决的方式:就是使用Protobuf定义的sint32/sint64类型表示负数,通过先采用Zigzag编码,将正数、负数和0都映射到无符号数,最后再采用varint编码
ZigZag编码
ZigZag是将符号数统一映射到无符号号数的一种编码方案,比如:对于0 -1 1 -2 2映射到无符号数 0 1 2 3 4。
原始值 | 映射值 |
---|---|
0 | 0 |
-1 | 1 |
1 | 2 |
2 | 3 |
-2 | 4 |
对应的源代码(CodedOutputStream)
/**
* Encode a ZigZag-encoded 32-bit value. ZigZag encodes signed integers
* into values that can be efficiently encoded with varint. (Otherwise,
* negative values must be sign-extended to 64 bits to be varint encoded,
* thus always taking 10 bytes on the wire.)
*
* @param n A signed 32-bit integer.
* @return An unsigned 32-bit integer, stored in a signed int because
* Java has no explicit unsigned support.
*/
public static int encodeZigZag32(final int n) {
// Note: the right-shift must be arithmetic
return (n << 1) ^ (n >> 31);
}
/**
* Encode a ZigZag-encoded 64-bit value. ZigZag encodes signed integers
* into values that can be efficiently encoded with varint. (Otherwise,
* negative values must be sign-extended to 64 bits to be varint encoded,
* thus always taking 10 bytes on the wire.)
*
* @param n A signed 64-bit integer.
* @return An unsigned 64-bit integer, stored in a signed int because
* Java has no explicit unsigned support.
*/
public static long encodeZigZag64(final long n) {
// Note: the right-shift must be arithmetic
return (n << 1) ^ (n >> 63);
}
下面是解码:
/**
* Decode a ZigZag-encoded 32-bit value. ZigZag encodes signed integers into values that can be
* efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
* to be varint encoded, thus always taking 10 bytes on the wire.)
*
* @param n An unsigned 32-bit integer, stored in a signed int because Java has no explicit
* unsigned support.
* @return A signed 32-bit integer.
*/
public static int decodeZigZag32(final int n) {
return (n >>> 1) ^ -(n & 1);
}
/**
* Decode a ZigZag-encoded 64-bit value. ZigZag encodes signed integers into values that can be
* efficiently encoded with varint. (Otherwise, negative values must be sign-extended to 64 bits
* to be varint encoded, thus always taking 10 bytes on the wire.)
*
* @param n An unsigned 64-bit integer, stored in a signed int because Java has no explicit
* unsigned support.
* @return A signed 64-bit integer.
*/
public static long decodeZigZag64(final long n) {
return (n >>> 1) ^ -(n & 1);
}
我们修改一下上面例子的proto文件,将age声明成sint32,修改test用例age赋值-1,查看Person.txt文件
08 01
除去key=08后,实际上-1只用了一个字节存储,相比较前面的10个字节存储,确实省去不少空间。
数据格式
Type | Meaning | Userd For |
---|---|---|
0 | Varint | int32,int64,uint32,uint64,sint32,sint64,bool,enum |
1 | 64-bit | fixed64,sfix64,double |
2 | Length-delimited | string,bytes,embedded messages,oacked repeated field |
3 | Strart Group | groups(deprecated) |
4 | End Group | groups(deprecated) |
5 | 32-bit | fixed 32,sfixed32,float |
protobuf是使用一种[key,value]的数据格式,key是使用该字段的field_number与wire_type取|后的值,field_number是定义proto文件时使用的tag序号。
在protobuf中的wire_type取值:
Type | Meaning | Userd For |
---|---|---|
0 | Varint | int32,int64,uint32,uint64,sint32,sint64,bool,enum |
1 | 64-bit | fixed64,sfix64,double |
2 | Length-delimited | string,bytes,embedded messages,oacked repeated field |
3 | Strart Group | groups(deprecated) |
4 | End Group | groups(deprecated) |
5 | 32-bit | fixed 32,sfixed32,float |
key的计算方式:
/** Makes a tag value given a field number and wire type. */
static int makeTag(final int fieldNumber, final int wireType) {
return (fieldNumber << TAG_TYPE_BITS) | wireType;
}
TAG_TYPE_BITS取值为3,也就是低位为wire_type,高位为field_number。
看下前面age=400的例子,因为age声明为int32,所以wire_type =0,key=(1<<3 | 0 )=0x08,value=0x90 0x03 ,由于序列化采用的是直连得方式,所以编码后的值为0x08 0x90 0x03