This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .proto
file syntax and how to generate data access classes from your .proto
files.
This is a reference guide – for a step by step example that uses many of the features described in this document, see the tutorial for your chosen language.
First let's look at a very simple example. Let's say you want to define a search request message format, where each search request has a query string, the particular page of results you are interested in, and a number of results per page. Here's the .proto
file you use to define the message type.
message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3; }
The SearchRequest
message definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.
In the above example, all the fields are scalar types: two integers (page_number
and result_per_page
) and a string (query
). However, you can also specify composite types for your fields, including enumerations and other message types.
As you can see, each field in the message definition has a unique numbered tag. These tags are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode. Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.
The smallest tag number you can specify is 1, and the largest is 229 - 1, or 536,870,911. You also cannot use the numbers 19000 though 19999 (FieldDescriptor::kFirstReservedNumber
through FieldDescriptor::kLastReservedNumber
), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto
.
You specify that message fields are one of the following:
required
: a well-formed message must have exactly one of this field.optional
: a well-formed message can have zero or one of this field (but not more than one).repeated
: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.For historical reasons, repeated
fields of basic numeric types aren't encoded as efficiently as they could be. New code should use the special option [packed=true]
to get a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];
Required Is Forever You should be very careful about marking fields as required
. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required
does more harm than good; they prefer to use only optional
and repeated
. However, this view is not universal.
Multiple message types can be defined in a single .proto
file. This is useful if you are defining multiple related messages – so, for example, if you wanted to define the reply message format that corresponds to your SearchResponse
message type, you could add it to the same .proto
:
message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3; } message SearchResponse { ... }
To add comments to your .proto
files, use C/C++-style //
syntax.
message SearchRequest { required string query = 1; optional int32 page_number = 2;// Which page number do we want? optional int32 result_per_page = 3;// Number of results to return per page. }
.proto
?When you run the protocol buffer compiler on a .proto
, the compiler generates the code in your chosen language you'll need to work with the message types you've described in the file, including getting and setting field values, serializing your messages to an output stream, and parsing your messages from an input stream.
For C++, the compiler generates a .h
and .cc
file from each .proto
, with a class for each message type described in your file.
For Java, the compiler generates a .java
file with a class for each message type, as well as a special Builder
classes for creating message class instances.
Python is a little different – the Python compiler generates a module with a static descriptor of each message type in your .proto
, which is then used with ametaclass to create the necessary Python data access class at runtime.
You can find out more about using the APIs for each language by following the tutorial for your chosen language. For even more API details, see the relevant API reference.
A scalar message field can have one of the following types – the table shows the type specified in the .proto
file, and the corresponding type in the automatically generated class:
double | double | double | |
float | float | float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long |
uint32 | Uses variable-length encoding. | uint32 | int[1] |
uint64 | Uses variable-length encoding. | uint64 | long[1] |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 228. | uint32 | int[1] |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 256. | uint64 | long[1] |
sfixed32 | Always four bytes. | int32 | int |
sfixed64 | Always eight bytes. | int64 | long |
bool | bool | boolean | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString |
You can find out more about how these types are encoded when you serialize your message in Protocol Buffer Encoding.
[1] In Java, unsigned 32-bit and 64-bit integers are represented using their signed counterparts, with the top bit simply being stored in the sign bit.
As mentioned above, elements in a message description can be labeled optional
. A well-formed message may or may not contain an optional element. When a message is parsed, if it does not contain an optional element, the corresponding field in the parsed object is set to the default value for that field. The default value can be specified as part of the message description. For example, let's say you want to provide a default value of 10 for a SearchRequest
's result_per_page
value.
optional int32 result_per_page = 3 [default = 10];
If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For bools, the default value is false. For numeric types, the default value is zero. For enums, the default value is the first value listed in the enum's type definition.
When you're defining a message type, you might want one of its fields to only have one of a pre-defined list of values. For example, let's say you want to add acorpus
field for each SearchRequest
, where the corpus can be UNIVERSAL
, WEB
, IMAGES
, LOCAL
, NEWS
, PRODUCTS
or VIDEO
. You can do this very simply by adding anenum
to your message definition - a field with an enum
type can only have one of a specified set of constants as its value (if you try to provide a different value, the parser will treat it like an unknown field). In the following example we've added an enum
called Corpus
with all the possible values, and a field of type Corpus
:
message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3 [default = 10]; enum Corpus { UNIVERSAL = 0; WEB = 1; IMAGES = 2; LOCAL = 3; NEWS = 4; PRODUCTS = 5; VIDEO = 6; } optional Corpus corpus = 4 [default = UNIVERSAL]; }
Enumerator constants must be in the range of a 32-bit integer. Since enum
values use varint encoding on the wire, negative values are inefficient and thus not recommended. You can define enum
s within a message definition, as in the above example, or outside – these enum
s can be reused in any message definition in your.proto
file. You can also use an enum
type declared in one message as the type of a field in a different message, using the syntax MessageType.EnumType
.
When you run the protocol buffer compiler on a .proto
that uses an enum
, the generated code will have a corresponding enum
for Java or C++, or a specialEnumDescriptor
class for Python that's used to create a set of symbolic constants with integer values in the runtime-generated class.
For more information about how to work with message enum
s in your applications, see the generated code guide for your chosen language.
You can use other message types as field types. For example, let's say you wanted to include Result
messages in each SearchResponse
message – to do this, you can define a Result
message type in the same .proto
and then specify a field of type Result
in SearchResponse
:
message SearchResponse { repeated Result result = 1; } message Result { required string url = 1; optional string title = 2; repeated string snippets = 3; }
In the above example, the Result
message type is defined in the same file as SearchResponse
– what if the message type you want to use as a field type is already defined in another .proto
file?
You can use definitions from other .proto
files by importing them. To import another .proto
's definitions, you add an import statement to the top of your file:
import "myproject/other_protos.proto";
The protocol compiler searches for imported files in a set of directories specified on the protocol compiler command line using the -I
/--import_path
flag. If no flag was given, it looks in the directory in which the compiler was invoked.
You can define and use message types inside other message types, as in the following example – here the Result
message is defined inside the SearchResponse
message:
message SearchResponse { message Result { required string url = 1; optional string title = 2; repeated string snippets = 3; } repeated Result result = 1; }
If you want to reuse this message type outside its parent message type, you refer to it as Parent.Type
:
message SomeOtherMessage { optional SearchResponse.Result result = 1; }
You can nest messages as deeply as you like:
message Outer { // Level 0 message MiddleAA { // Level 1 message Inner { // Level 2 required int64 ival = 1; optional bool booly = 2; } } message MiddleBB { // Level 1 message Inner { // Level 2 required int32 ival = 1; optional bool booly = 2; } } }
Note that this feature is deprecated and should not be used when creating new message types – use nested message types instead.
Groups are another way to nest information in your message definitions. For example, another way to specify a SearchResponse
containing a number of Result
s is as follows:
message SearchResponse { repeated group Result = 1 { required string url = 2; optional string title = 3; repeated string snippets = 4; } }
A group simply combines a nested message type and a field into a single declaration. In your code, you can treat this message just as if it had a Result
type field called result
(the latter name is converted to lower-case so that it does not conflict with the former). Therefore, this example is exactly equivalent to theSearchResponse
above, except that the message has a different wire format.
If an existing message type no longer meets all your needs – for example, you'd like the message format to have an extra field – but you'd still like to use code created with the old format, don't worry! It's very simple to update message types without breaking any of your existing code. Just remember the following rules:
optional
or repeated
. This means that any messages serialized by code using your "old" message format can be parsed by your new generated code, as they won't be missing any required
elements. You should set up sensible default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. However, the unknown fields are not discarded, and if the message is later serialized, the unknown fields are serialized along with it – so if the message is passed on to new code, the new fields are still available. Note that preservation of unknown fields is currently not available for Python..proto
can't accidentally reuse the number).int32
, uint32
, int64
, uint64
, and bool
are all compatible – this means you can change a field from one of these types to another without breaking forwards- or backwards-compatibility. If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if you had cast the number to that type in C++ (e.g. if a 64-bit number is read as an int32, it will be truncated to 32 bits).sint32
and sint64
are compatible with each other but are not compatible with the other integer types.string
and bytes
are compatible as long as the bytes are valid UTF-8.bytes
if the bytes contain an encoded version of the message.fixed32
is compatible with sfixed32
, and fixed64
with sfixed64
.Extensions let you declare that a range of field numbers in a message are available for third-party extensions. Other people can then declare new fields for your message type with those numeric tags in their own .proto
files without having to edit the original file. Let's look at an example:
message Foo { // ... extensions 100 to 199; }
This says that the range of field numbers [100, 199] in Foo
is reserved for extensions. Other users can now add new fields to Foo
in their own .proto
files that import your .proto
, using tags within your specified range – for example:
extend Foo { optional int32 bar = 126; }
This says that Foo
now has an optional int32
field called bar
.
When your user's Foo
messages are encoded, the wire format is exactly the same as if the user defined the new field inside Foo
. However, the way you access extension fields in your application code is slightly different to accessing regular fields – your generated data access code has special accessors for working with extensions. So, for example, here's how you set the value of bar
in C++:
Foo foo; foo.SetExtension(bar, 15);
Similarly, the Foo
class defines templated accessors HasExtension()
, ClearExtension()
, GetExtension()
, MutableExtension()
, and AddExtension()
. All have semantics matching the corresponding generated accessors for a normal field. For more information about working with extensions, see the generated code reference for your chosen language.
Note that extensions can be of any field type, including message types.
You can declare extensions in the scope of another type:
message Baz { extend Foo { optional int32 bar = 126; } ... }
In this case, the C++ code to access this extension is:
Foo foo; foo.SetExtension(Baz::bar, 15);
In other words, the only effect is that bar
is defined within the scope of Baz
.
This is a common source of confusion: Declaring an extend
block nested inside a message type does not imply any relationship between the outer type and the extended type. In particular, the above example does not mean that Baz
is any sort of subclass of Foo
. All it means is that the symbol bar
is declared inside the scope of Baz
; it's simply a static member.
A common pattern is to define extensions inside the scope of the extension's field type – for example, here's an extension to Foo
of type Baz
, where the extension is defined as part of Baz
:
message Baz { extend Foo { optional Baz foo_ext = 127; } ... }
However, there is no requirement that an extension with a message type be defined inside that type. You can also do this:
message Baz { ... } // This can even be in a different file. extend Foo { optional Baz foo_baz_ext = 127; }
In fact, this syntax may be preferred to avoid confusion. As mentioned above, the nested syntax is often mistaken for subclassing by users who are not already familiar with extensions.
It's very important to make sure that two users don't add extensions to the same message type using the same numeric tag – data corruption can result if an extension is accidentally interpreted as the wrong type. You may want to consider defining an extension numbering convention for your project to prevent this happening.
If your numbering convention might involve extensions having very large numbers as tags, you can specify that your extension range goes up to the maximum possible field number using the max
keyword:
message Foo { extensions 1000 to max; }
max
is 229 - 1, or 536,870,911.
As when choosing tag numbers in general, your numbering convention also needs to avoid field numbers 19000 though 19999 (FieldDescriptor::kFirstReservedNumber
through FieldDescriptor::kLastReservedNumber
), as they are reserved for the Protocol Buffers implementation. You can define an extension range that includes this range, but the protocol compiler will not allow you to define actual extensions with these numbers.
You can add an optional package
specifier to a .proto
file to prevent name clashes between protocol message types.
package foo.bar; message Open { ... }
You can then use the package specifier when defining fields of your message type:
message Foo { ... required foo.bar.Open open = 1; ... }
The way a package specifier affects the generated code depends on your chosen language:
Open
would be in the namespace foo::bar
.option java_package
in your .proto
file.Type name resolution in the protocol buffer language works like C++: first the innermost scope is searched, then the next-innermost, and so on, with each package considered to be "inner" to its parent package. A leading '.' (for example, .foo.bar.Baz
) means to start from the outermost scope instead.
The protocol buffer compiler resolves all type names by parsing the imported .proto
files. The code generator for each language knows how to refer to each type in that language, even if it has different scoping rules.
If you want to use your message types with an RPC (Remote Procedure Call) system, you can define an RPC service interface in a .proto
file and the protocol buffer compiler will generate service interface code and stubs in your chosen language. So, for example, if you want to define an RPC service with a method that takes yourSearchRequest
and returns a SearchResponse
, you can define it in your .proto
file as follows:
service SearchService { rpc Search (SearchRequest) returns (SearchResponse); }
The protocol compiler will then generate an abstract interface called SearchService
and a corresponding "stub" implementation. The stub forwards all calls to anRpcChannel
, which in turn is an abstract interface that you must define yourself in terms of your own RPC system. For example, you might implement anRpcChannel
which serializes the message and sends it to a server via HTTP. In other words, the generated stub provides a type-safe interface for making protocol-buffer-based RPC calls, without locking you into any particular RPC implementation. So, in C++, you might end up with code like this:
using google::protobuf; protobuf::RpcChannel* channel; protobuf::RpcController* controller; SearchService* service; SearchRequest request; SearchResponse response;color