In Chapter 1, we introduce the idea of evolvability
: we should aim to build systems that make it easy to adapt to change. In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or a record type needs to be captured, or perhaps existing data needs to be presented in a new way.
In Chapter 2, we discussed different ways of coping with such change.
With server-side applications you may want to perform a rolling upgrade. With cliend-side applications, client may not install the update for some time.
In order for the system to continue work smoothly, we need to maintain compatibility in both old codes and new codes.
There are two different representations of data, they are:
The translation from the in-memory representation to a byte sequence is called encoding, and the reverse is called decoding.
Many programming languages come with built-in support for encoding in-memory objects into byte sequences. But they have a number of deep problems.
Problems:
the following is an example:
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
Thrift interface definition language (IDL) is like this:
struct Person{
1:required string userName,
2:optional i64 favoriteNumber,
3:optional list<string> interests
}
Protocol Buffers IDL very similar:
message Person {
required string user_name = 1;
optional int64 favorite_number = 2;
repeated string interests = 3;
}
Thrift has two different binary encoding formats, called BinaryProtocol and CompactProtocol.
MessagePack
is that there are no field names(userName, favoriteNumber). Instead, the encoded data contains field tags.It is similar to Thrift’s CompactProtocol.
in the schemas shown earlier, each field was marked either required or optional, but this makes no difference to how the field is encoded(nothing in the binary data indicates whether a field was required). The difference is simply that required enables a runtime check that fails if the field is not set, which can be useful for catching bugs.
How to keep backward and forward compatibility?
However, if old code reads data written by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit value won’t fit in 32 bits, it will be truncated.
Protocol Buffers is not have a list or array datatype, but instead has a repeated marker for fields.
Thrift has a dedicated list datatype.
Avro’s IDL look like this:
record Person {
string userName;
union { null, long } favoriteNumber = null;
array<string> interests;
}
The Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.
To maintain compatibility, you may only add or remove a field that has a default value.
For example, union { null, long, string } field
; indicates that field can be a number, a string, or null.
How does the reader know the writer’s schema?
The answer depends on the context in which Avro is being used. To give a few examples:
Avro schema can be easily generated from the relational schema and encode the database contents using that schema. If the database schema changes (for example, a table has one column added and one column removed), you can just generate a new Avro schema from the updated database schema and export data in the new Avro schema.
In dynamically typed programming languages such as JavaScript, Ruby, or Python, there is not much point in generating code, since there is no compile-time type checker to satisfy.
Avro provides optional code generation for statically typed programming languages, but it can be used just as well without any code generation.
Binary encodings have a number of nice properties:
In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON databases provide, while also providing better guarantees about your data and better tooling.
Compatibility is a relationship between one process that encodes the data, and another process that decodes it. In the rest of this chapter we will explore some of the most common ways how data flows between processes.