Like many databases, CockroachDB (CRDB) encodes SQL data into key-value (KV) pairs. The format evolves over time, with an eye toward backward compatibility. This document describes format version 3 in detail except for how CRDB encodes primitive values (pkg/util/encoding/encoding.go).
与许多数据库一样,CockroachDB(CRDB)将SQL数据编码为键值(KV)对。格式随着时间的推移而变化,着眼于向后兼容。本文档详细介绍了Format Version 3,但CRDB编码基元值(pkg/util/ENCODING/encoding.go)的方式除外。
The Cockroach Labs blog post SQL in CockroachDB: Mapping Table Data to Key-Value Storage covers format version 1, which predates column families, interleaving, and composite encoding. Format version 2 introduced column families, covered in Implementing Column Families in CockroachDB. See also the column families RFC and the interleaving RFC.
This document was originally written by David Eisenstat
SQL tables consist of a rectangular array of data and some metadata. The metadata include a unique table ID; a nonempty list of primary key columns, each with an ascending/descending designation; and some information about each column. Each column has a numeric ID that is unique within the table, a SQL type, and a column family ID. A column family is a maximal subset of columns with the same column family ID. For more details, see pkg/sql/sqlbase/structured.proto.
SQL表由数据的矩形数组和一些元数据组成。
元数据包括:
Each row of a table gives rise to one or more KV pairs, one per column family as needed (see subsection NULL below). CRDB stores primary key data in KV keys and other data in KV values so that it can use the KV layer to prevent duplicate primary keys. For encoding, see pkg/sql/rowwriter.go. For decoding, see pkg/sql/sqlbase/multirowfetcher.go.
表的每一行产生一个或多个KV对,根据需要每个列族一个(请参见下面的NULL小节)。CRDB将主键数据存储在KV键中,其他数据存储在KV值中,以便它可以使用KV层来防止重复的主键。
KV keys consist of several fields:
KV键由几个字段组成:
CRDB encodes these fields individually and concatenates the resulting bytes. The decoder can determine the field boundaries because the field encoding is prefix-free.
CRDB单独编码这些字段并连接结果字节。 解码器可以确定字段的边界,因为场编码是无前缀的。
Encoded fields start with a byte that indicates the type of the field. For primary key fields, this type has a one-to-many relationship with the SQL datum type. The SQL types STRING
and BYTES
, for example, share an encoding. The relationship will become many-to-many when CRDB introduces a new DECIMAL
encoding, since the old decoder will be retained for backward compatibility.
编码字段以指示字段类型的字节开头。 对于主键字段,此类型与SQL数据类型具有一对多关系。 例如,SQL类型STRING和BYTES共享编码。 当CRDB引入新的DECIMAL编码时,该关系将变为多对多,因为旧的解码器将被保留以实现向后兼容。
The format of the remaining bytes depends on the field type. The details (in pkg/util/encoding/encoding.go) are irrelevant here except that, for primary key fields, these bytes have the following order property. Consider a particular primary key column and let enc be the mathematical function that maps SQL data in that column to bytes.
剩余字节的格式取决于字段类型。 详细信息(在pkg / util / encoding / encoding.go中)与此无关,除了对于主键字段,这些字节具有以下order属性。 考虑一个特定的主键列,让enc成为将该列中的SQL数据映射到字节的数学函数。
如果列具有升序指定,则对于数据x和y,当且仅当x≤y时,enc(x)≤enc(y)。
如果列具有降序指定,则对于数据x和y,当且仅当x≥y时,enc(x)≤enc(y)。
In conjunction with prefix freedom, the order property ensures that the SQL layer and the KV layer sort primary keys the same way.
结合无前缀编码,order属性确保SQL层和KV层以相同的方式对主键进行排序。
For more details on primary key encoding, see EncodeTableKey
(pkg/sql/sqlbase/table.go). See also EncDatum
(pkg/sql/sqlbase/encoded_datum.go).
KV values consist of
ValueType
in pkg/roachpb/data.proto)KV值包括:
The value type defaults to TUPLE
, which indicates the following encoding. (For other values, see subsection Single-column column families below.) For each column in the column family sorted by column ID, encode the column ID difference and the datum encoding type (unrelated to the value type!) jointly, followed by the datum itself. The column ID difference is the column ID minus the previous column ID if this column is not the first, else the column ID. The joint encoding is commonly one byte, which displays conveniently in hexadecimal as the column ID difference followed by the datum encoding type.
值类型默认为tuple,表示以下编码。(有关其他值,请参见下面的“单列族”小节。)对于按列ID排序的列族中的每一列,编码列ID差异和基准编码类型(与值类型无关!)然后是数据本身。列ID差异是列ID减去前一列ID(如果此列不是第一列),否则是列ID。联合编码通常是一个字节,它以十六进制形式方便地显示为列ID差异,后跟数据编码类型。
The Go function that performs the joint encoding is encodeValueTag
(pkg/util/encoding/encoding.go), which emits an unsigned integer with a variable-length encoding. The low four bits of the integer contain the datum encoding type. The rest contain the column ID difference. As an alternative for datum encoding types greater than 14, encodeValueTag
sets the low four bits to SentinelType
(15) and emits the actual datum encoding type next.
执行联合编码的go函数是encodeValuetag(pkg/util/encoding/encoding.go),它发出具有可变长度编码的无符号整数。整数的低位四位包含数据编码类型。其余的包含列ID差异。作为大于14的数据编码类型的替代方案,encodeValuetag将低4位设置为sentineltype(15),然后发出实际的数据编码类型。
Note: Values for sequences are a special case: the sequence value is encoded as if the sequence were a one-row, one-column table, with the key structured in the usual way: /Table/
. However, the value is a bare int64; it doesn't use the encoding specified here. This is because it is incremented using the KV Increment
operation so that the increment can be done in one roundtrip, not a read followed by a write as would be required by a normal SQL UPDATE
.
注:序列值是一种特殊情况:序列值的编码方式与序列是一行一列的表一样,键的结构通常为:/table/
An alternative design would be to teach the KV Inc operation to understand SQL value encoding so that the sequence could be encoded consistently with tables, but that would break the KV/SQL abstraction barrier.
另一种设计是教kv inc操作理解SQL值编码,以便序列可以与表一致地编码,但这将打破kv/sql抽象的障碍。
The column family with ID 0 is special because it contains the primary key columns. The KV pairs arising from this column family are called sentinel KV pairs. CRDB emits sentinel KV pairs regardless of whether the KV value has other data, to guarantee that primary keys appear in at least one KV pair. (Even if there are other column families, their KV pairs may be suppressed; see subsection NULL below.)
ID为0的列族是特殊的,因为它包含主键列。 由该列族产生的KV对称为哨兵KV对。 无论KV值是否具有其他数据,CRDB都会发出哨兵KV对,以保证主键出现在至少一个KV对中。 (即使存在其他列族,也可以抑制它们的KV对;请参阅下面的小节NULL。)
Before column families (i.e., in format version 1), non-sentinel KV keys had a column ID where the column family ID is now. Non-sentinel KV values contained exactly one datum, whose encoding was indicated by the one-byte value type (see MarshalColumnValue
in pkg/sql/sqlbase/table.go). Unlike the TUPLE
encoding, this encoding did not need to be prefix-free, which was a boon for strings.
在列族之前(即,格式版本1),非哨兵KV键具有列ID,其中列族ID现在是。 非哨兵KV值只包含一个数据,其编码由单字节值类型指示(请参阅pkg / sql / sqlbase / table.go中的MarshalColumnValue)。 与TUPLE编码不同,这种编码不需要是无前缀的,这对字符串来说是一个福音。
On upgrading to format version 2 or higher, CRDB puts each existing column in a column family whose ID is the same as the column ID. This allows backward-compatible encoding and decoding. The encoder uses the old format for single-column column families when the ID of that column equals the DefaultColumnID
of the column family (pkg/sql/sqlbase/structured.proto).
在升级到格式版本2或更高版本时,CRDB会将每个现有列放在列族中,其ID与列ID相同。 这允许向后兼容的编码和解码。 当该列的ID等于列族的DefaultColumnID(pkg / sql / sqlbase / structured.proto)时,编码器将旧格式用于单列列族。
SQL NULL
has no explicit encoding in tables (primary indexes). Instead, CRDB encodes each row as if the columns where that row is null did not exist. If all of the columns in a column family are null, then the corresponding KV pair is suppressed. The motivation for this design is that adding a column does not require existing data to be re-encoded.
SQL NULL在表(主索引)中没有显式编码。相反,CRDB对每一行进行编码,就好像该行为空的列不存在一样。如果列族中的所有列都为空,则相应的kv对将被抑制。这种设计的动机是添加列不需要对现有数据重新编码。
The commands below create a table and insert some data. An annotated KV dump follows.
下面的命令创建一个表并插入一些数据。下面是一个带注释的KV转储。
CREATE TABLE accounts (
id INT PRIMARY KEY,
owner STRING,
balance DECIMAL,
FAMILY f0 (id, balance),
FAMILY f1 (owner)
);
INSERT INTO accounts VALUES
(1, 'Alice', 10000.50),
(2, 'Bob', 25000.00),
(3, 'Carol', NULL),
(4, NULL, 9400.10),
(5, NULL, NULL);
Here is the relevant output from cockroach debug rocksdb scan --value_hex
, with annotations.
这是来自cockroach debug rocksdb scan --value_hex的带有注释的相关输出。
/Table/51/1/1/0/1489427290.811792567,0 : 0xB244BD870A3505348D0F4272
^- ^ ^ ^ ^-------^-^^^-----------
| | | | | | |||
Table ID (accounts) Checksum| |||
| | | | |||
Index ID Value type (TUPLE)
| | |||
Primary key (id = 1) Column ID difference
| ||
Column family ID (f0) Datum encoding type (Decimal)
|
Datum encoding (10000.50)
/Table/51/1/1/1/1/1489427290.811792567,0 : 0x30C8FBD403416C696365
^- ^ ^ ^ ^ ^-------^-^---------
| | | | | | | |
Table ID (accounts) Checksum| |
| | | | | |
Index ID Value type (BYTES)
| | | |
Primary key (id = 1) Datum encoding ('Alice')
| |
Column family ID (f1)
|
Column family ID encoding length
/Table/51/1/2/0/1489427290.811792567,0 : 0x2C8E35730A3505348D2625A0
^ ^-----------
2 25000.00
/Table/51/1/2/1/1/1489427290.811792567,0 : 0xE911770C03426F62
^ ^-----
2 'Bob'
/Table/51/1/3/0/1489427290.811792567,0 : 0xCF8B38950A
^
3
/Table/51/1/3/1/1/1489427290.811792567,0 : 0x538EE3D6034361726F6C
^ ^---------
3 'Carol'
/Table/51/1/4/0/1489427290.811792567,0 : 0x247286F30A3505348C0E57EA
^ ^-----------
4 9400.10
/Table/51/1/5/0/1489427290.811792567,0 : 0xCB0644270A
^
5
There exist decimal numbers and collated strings that are equal but not identical, e.g., 1.0 and 1.000. This is problematic because in primary keys, 1.0 and 1.000 must have the same encoding, which precludes lossless decoding. Worse, the encoding of collated strings in primary keys is defined by the Unicode Collation Algorithm, which may not even have an efficient partial inverse.
存在相等但不相同的十进制数和经过排序的字符串,例如1.0和1.000。这是有问题的,因为在主键中,1.0和1.000必须具有相同的编码,这就排除了无损解码。更糟糕的是,主键中排序字符串的编码由Unicode排序算法定义,该算法甚至可能没有有效的部分反转。
When collated strings and (soon) decimals appear in primary keys, they have composite encoding. For collated strings, this means encoding data as both a key and value, with the latter appearing in the sentinel KV value (naturally, since the column belongs to the column family with ID 0).
当经过排序的字符串和(很快)小数出现在主键中时,它们具有复合编码。对于经过整理的字符串,这意味着将数据编码为键和值,后者出现在sentinel kv值中(当然,因为该列属于ID为0的列族)。
Example schema and data:
CREATE TABLE owners (
owner STRING COLLATE en PRIMARY KEY
);
INSERT INTO owners VALUES
('Bob' COLLATE en),
('Ted' COLLATE en);
Example dump:
/Table/51/1/"\x16\x05\x17q\x16\x05\x00\x00\x00 \x00 \x00 \x00\x00\b\x02\x02"/0/1489502864.477790157,0 : 0xDC5FDAE10A1603426F62
^--------------------------------------------------------------- ^-------
Collation key for 'Bob' 'Bob'
/Table/51/1/"\x18\x16\x16L\x161\x00\x00\x00 \x00 \x00 \x00\x00\b\x02\x02"/0/1489502864.477790157,0 : 0x8B30B9290A1603546564
^------------------------------------------------------------ ^-------
Collation key for 'Ted' 'Ted'
To unify the handling of SQL tables and indexes, CRDB stores the authoritative table data in what is termed the primary index. SQL indexes are secondary indexes. All indexes have an ID that is unique within their table.
为了统一处理SQL表和索引,CRDB将权威的表数据存储在所谓的主索引中。SQL索引是次索引。所有索引的表中都有一个唯一的ID。
The user-specified metadata for secondary indexes include a nonempty list of indexed columns, each with an ascending/descending designation, and a disjoint list of stored columns. The first list determines how the index is sorted, and columns from both lists can be read directly from the index.
用户指定的二级索引的元数据包括一个非空的索引列列表(每个列具有升序/降号)和一个不相交的存储列表。第一个列表确定索引的排序方式,可以直接从索引中读取两个列表中的列。
Users also specify whether a secondary index should be unique. Unique secondary indexes constrain the table data not to have two rows where, for each indexed column, the data therein are non-null and equal.
用户还指定辅助索引是否应该是唯一的。唯一辅助索引约束表数据不能有两行,其中对于每个索引列,其中的数据都是非空的且相等。
The main encoding function for secondary indexes is EncodeSecondaryIndex
in pkg/sql/sqlbase/table.go. Each row gives rise to one KV pair per secondary index, whose KV key has fields mirroring the primary index encoding:
二级索引的主要编码功能是pkg / sql / sqlbase / table.go中的EncodeSecondaryIndex。 每行产生一个KV对,每个二级索引,其KV键具有镜像主索引编码的字段:
Unique indexes relegate the data in extra columns to KV values so that the KV layer detects constraint violations. The special case for an indexed NULL arises from the fact that NULL does not equal itself, hence rows with an indexed NULL cannot be involved in a violation. They need a unique KV key nonetheless, as do rows in non-unique indexes, which is achieved by including the non-indexed primary key data. For the sake of simplicity, data in stored columns are also included.
唯一索引将额外列中的数据关联到kv值,以便kv层检测违反约束的情况。索引空值的特殊情况是由于空值本身不相等,因此具有索引空值的行不能参与冲突。但是,它们需要一个唯一的kv键,非唯一索引中的行也需要这样做,这是通过包含非索引的主键数据来实现的。为了简单起见,还包括存储列中的数据。
KV values for secondary indexes have value type BYTES
and consist of:
TUPLE
-encoded bytes for non-null composite and stored column data (new format).二级索引的KV值具有值类型BYTES,包括:
All of these fields are optional, so the BYTES
value may be empty. Note that, in a unique index, rows with a NULL in an indexed column have their implicit column data stored in both the KV key and the KV value. (Ditto for stored column data in the old format.)
所有这些字段都是可选的,因此BYTES值可能为空。 请注意,在唯一索引中,索引列中具有NULL的行将其隐式列数据存储在KV键和KV值中。 (同样以旧格式存储列数据。)
Example schema and data:
CREATE TABLE accounts (
id INT PRIMARY KEY,
owner STRING,
balance DECIMAL,
UNIQUE INDEX i2 (owner) STORING (balance),
INDEX i3 (owner) STORING (balance)
);
INSERT INTO accounts VALUES
(1, 'Alice', 10000.50),
(2, 'Bob', 25000.00),
(3, 'Carol', NULL),
(4, NULL, 9400.10),
(5, NULL, NULL);
Index ID 1 is the primary index.
/Table/51/1/1/0/1489504989.617188491,0 : 0x4AAC12300A2605416C6963651505348D0F4272
/Table/51/1/2/0/1489504989.617188491,0 : 0x148941AD0A2603426F621505348D2625A0
/Table/51/1/3/0/1489504989.617188491,0 : 0xB1D0B5390A26054361726F6C
/Table/51/1/4/0/1489504989.617188491,0 : 0x247286F30A3505348C0E57EA
/Table/51/1/5/0/1489504989.617188491,0 : 0xCB0644270A
Old STORING format
Index ID 2 is the unique secondary index i2
.
/Table/51/2/NULL/4/9400.1/0/1489504989.617188491,0 : 0x01CF9BB0038C2BBD011400
^--- ^ ^----- ^-^-^---------
Indexed column | Stored column BYTES 4 9400.1
Implicit column
/Table/51/2/NULL/5/NULL/0/1489504989.617188491,0 : 0xE86B1271038D00
^--- ^ ^--- ^-^-^-
Indexed column | Stored column BYTES 5 NULL
Implicit column
/Table/51/2/"Alice"/0/1489504989.617188491,0 : 0x285AC6F303892C0301016400
^------ ^-^-^-----------
Indexed column BYTES 1 10000.5
/Table/51/2/"Bob"/0/1489504989.617188491,0 : 0x23514F1F038A2C056400
^---- ^-^-^-------
Indexed column BYTES 2 2.5E+4
/Table/51/2/"Carol"/0/1489504989.617188491,0 : 0xE98BFEE6038B00
^------ ^-^-^-
Indexed column BYTES 3 NULL
Index ID 3 is the non-unique secondary index i3
.
/Table/51/3/NULL/4/9400.1/0/1489504989.617188491,0 : 0xEEFAED0403
^--- ^ ^----- ^-
Indexed column | Stored column BYTES
Implicit column
/Table/51/3/NULL/5/NULL/0/1489504989.617188491,0 : 0xBE090D2003
^--- ^ ^--- ^-
Indexed column | Stored column BYTES
Implicit column
/Table/51/3/"Alice"/1/10000.5/0/1489504989.617188491,0 : 0x7B4964C303
^------ ^ ^------ ^-
Indexed column | Stored column BYTES
Implicit column
/Table/51/3/"Bob"/2/2.5E+4/0/1489504989.617188491,0 : 0xDF24708303
^---- ^ ^----- ^-
Indexed column | Stored column BYTES
Implicit column
/Table/51/3/"Carol"/3/NULL/0/1489504989.617188491,0 : 0x96CA34AD03
^------ ^ ^--- ^-
Indexed column | Stored column BYTES
Implicit column
New STORING format
Index ID 2 is the unique secondary index i2
.
/Table/51/2/NULL/4/0/1492010940.897101344,0 : 0x7F2009CC038C3505348C0E57EA
^--- ^ ^-^-^-------------
Indexed column Implicit column BYTES 4 9400.10
/Table/51/2/NULL/5/0/1492010940.897101344,0 : 0x48047B1A038D
^--- ^ ^-^-
Indexed column Implicit column BYTES 5
/Table/51/2/"Alice"/0/1492010940.897101344,0 : 0x24090BCE03893505348D0F4272
^------ ^-^-^-------------
Indexed column BYTES 1 10000.50
/Table/51/2/"Bob"/0/1492010940.897101344,0 : 0x54353EB9038A3505348D2625A0
^---- ^-^-^-------------
Indexed column BYTES 2 25000.00
/Table/51/2/"Carol"/0/1492010940.897101344,0 : 0xE731A320038B
^------ ^-^-
Indexed column BYTES 3
Index ID 3 is the non-unique secondary index i3
.
/Table/51/3/NULL/4/0/1492010940.897101344,0 : 0x17C357B0033505348C0E57EA
^--- ^ ^-^-------------
Indexed column Implicit column BYTES 9400.10
/Table/51/3/NULL/5/0/1492010940.897101344,0 : 0x844708BC03
^--- ^ ^-
Indexed column Implicit column BYTES
/Table/51/3/"Alice"/1/0/1492010940.897101344,0 : 0x3AD2E728033505348D0F4272
^------ ^ ^-^-------------
Indexed column Implicit column BYTES 10000.50
/Table/51/3/"Bob"/2/0/1492010940.897101344,0 : 0x7F1225A4033505348D2625A0
^---- ^ ^-^-------------
Indexed column Implicit column BYTES 25000.00
/Table/51/3/"Carol"/3/0/1492010940.897101344,0 : 0x45C61B8403
^------ ^ ^-
Indexed column Implicit column BYTES
Secondary indexes use key encoding for all indexed columns, implicit columns, and stored columns in the old format. Every datum whose key encoding does not suffice for decoding (collated strings, floating-point and decimal negative zero, decimals with trailing zeros) is encoded again, in the same TUPLE
that contains stored column data in the new format.
Example schema and data:
CREATE TABLE owners (
id INT PRIMARY KEY,
owner STRING COLLATE en,
INDEX i2 (owner)
);
INSERT INTO owners VALUES
(1, 'Ted' COLLATE en),
(2, 'Bob' COLLATE en),
(3, NULL);
Index ID 1 is the primary index.
/Table/51/1/1/0/1492008659.730236666,0 : 0x6CA87E2B0A2603546564
/Table/51/1/2/0/1492008659.730236666,0 : 0xE900EBB50A2603426F62
/Table/51/1/3/0/1492008659.730236666,0 : 0xCF8B38950A
Index ID 2 is the secondary index i2
.
/Table/51/2/NULL/3/0/1492008659.730236666,0 : 0xBDAA5DBE03
^--- ^-
Indexed column BYTES
/Table/51/2/"\x16\x05\x17q\x16\x05\x00\x00\x00 \x00 \x00 \x00\x00\b\x02\x02"/2/0/1492008659.730236666,0 : 0x4A8239F6032603426F62
^--------------------------------------------------------------- ^-^---------
Indexed column: Collation key for 'Bob' BYTES 'Bob'
/Table/51/2/"\x18\x16\x16L\x161\x00\x00\x00 \x00 \x00 \x00\x00\b\x02\x02"/1/0/1492008659.730236666,0 : 0x747DA39A032603546564
^------------------------------------------------------------ ^-^---------
Indexed column: Collation key for 'Ted' BYTES 'Ted'
By default, indexes (in CRDB terminology, so both primary and secondary) occupy disjoint KV key spans. Users can request that an index be interleaved with another index, which improves the efficiency of joining them.
One index, the parent, must have a primary key that, ignoring column names, is a prefix (not necessarily proper) of the other index, the child. The parent, which currently must be a primary index, has its usual encoding. To encode a KV key in the child, encode it as if it were in the parent but with an interleaving sentinel (EncodeNotNullDescending
in pkg/util/encoding/encoding.go) where the column family ID would be. Append the non-interleaved child encoding but without the parent columns. The sentinel informs the decoder that the row does not belong to the parent table.
Note that the parent may itself be interleaved. In general, the interleaving relationships constitute an arborescence.
Example schema and data:
CREATE TABLE owners (
owner_id INT PRIMARY KEY,
owner STRING
);
CREATE TABLE accounts (
owner_id INT,
account_id INT,
balance DECIMAL,
PRIMARY KEY (owner_id, account_id)
) INTERLEAVE IN PARENT owners (owner_id);
INSERT INTO owners VALUES (19, 'Alice');
INSERT INTO accounts VALUES (19, 83, 10000.50);
Example dump:
/Table/51/1/19/0/1489433137.133889094,0 : 0xDBCE04550A2605416C696365
^- ^ ^- ^ ^-------^-^^^-----------
| | | | | | |||
Table ID (owners) Checksum| |||
| | | | |||
Index ID Value type (TUPLE)
| | |||
Primary key (owner_id = 19) Column ID difference
| ||
Column family ID Datum encoding type (Bytes)
|
Datum encoding ('Alice')
/Table/51/1/19/#/52/1/83/0/1489433137.137447008,0 : 0x691956790A3505348D0F4272
^- ^ ^- ^ ^- ^ ^- ^ ^-------^-^^^-----------
| | | | | | | | | | |||
Table ID (owners) | Checksum| |||
| | | | | | | | |||
Index ID | | | Value type (TUPLE)
| | | | | | |||
Primary key (owner_id = 19) Column ID difference
| | | | | ||
Interleaving sentinel Datum encoding type (Decimal)
| | | | |
Table ID (accounts) Datum encoding (10000.50)
| | |
Index ID
| |
Primary key (account_id = 83)
|
Column family ID