Protozero is a very low level library. You really have to know some of the insides of Protocol Buffers to work with it!
So before reading any further in this document, read the following from the Protocol Buffer documentation:
Make sure you understand the basic types of values supported by Protocol Buffers. Refer to this handy table and the cheat sheet if you are getting lost.
You need a C++11-capable compiler for Protozero to work. Copy the files in the
include/protozero
directory somewhere where your build system can find them.
Keep the protozero
directory and include the files in the form
#include <protozero/FILENAME.hpp>
To use the pbf_reader
class, add this include to your C++ program:
#include <protozero/pbf_reader.hpp>
The pbf_reader
class contains asserts that will detect some programming
errors. We encourage you to compile with asserts enabled in your debug builds.
Lets say you have a protocol description in a .proto
file like this:
message Example1 {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
To read messages created according to that description, you will have code that looks somewhat like this:
#include <protozero/pbf_reader.hpp>
// get data from somewhere into the input string
std::string input = get_input_data();
// initialize pbf message with this data
protozero::pbf_reader message{input};
// iterate over fields in the message
while (message.next()) {
// switch depending on the field tag (the field name is not available)
switch (message.tag()) {
case 1:
// get data for tag 1 (in this case an uint32)
auto x = message.get_uint32();
break;
case 2:
// get data for tag 2 (in this case a string)
std::string s = message.get_string();
break;
case 17:
// ignore data for tag 17
message.skip();
break;
default:
// ignore data for unknown tags to allow for future extensions
message.skip();
}
}
You always have to call next()
and then either one of the accessor functions
(like get_uint32()
or get_string()
) to get the field value or skip()
to
ignore this field. Then call next()
again, and so forth. Never call next()
twice in a row or any if the accessor or skip functions twice in a row.
Because the pbf_reader
class doesn't know the .proto
file it doesn't know
which field names or tags there are and it doesn't known the types of the
fields. You have to make sure to call the right get_...()
function for each
tag. Some assert()s
are done to check you are calling the right functions,
but not all errors can be detected.
Note that it doesn't matter whether a field is defined as required
,
optional
, or repeated
. You always have to be prepared to get zero, one, or
more instances of a field and you always have to be prepared to get other
fields, too, unless you want your program to break if somebody adds a new
field.
If, out of a protocol buffer message, you only need the value of a single
field, you can use the version of the next()
function with a parameter:
// same .proto file and initialization as above
// get all fields with tag 17, skip all others
while (message.next(17)) {
auto r = message.get_fixed64();
std::cout << r << "\n";
}
As you saw in the example, handling scalar field types is reasonably easy. You
just check the .proto
file for the type of a field and call the corresponding
function called get_
+ field type.
For string
and bytes
types the internal handling is exactly the same, but
both get_string()
and get_bytes()
are provided to make the code
self-documenting. Both theses calls allocate and return a std::string
which
can add some overhead. You can call the get_view()
function instead which
returns a data_view
containing a pointer into the data (access with data()
)
and the length of the data (access with size()
).
Fields that are marked as [packed=true]
in the .proto
file are handled
somewhat differently. get_packed_...()
functions returning an iterator range
are used to access the data.
So, for example, if you have a protocol description in a .proto
file like
this:
message Example2 {
repeated sint32 i = 1 [packed=true];
}
You can get to the data like this:
protozero::pbf_reader message{input.data(), input.size()};
// set current field
message.next(1);
// get an iterator range
auto pi = message.get_packed_sint32();
// iterate to get to all values
for (auto it = pi.begin(); it != pi.end(); ++it) {
std::cout << *it << '\n';
}
Or, with a range-based for-loop:
for (auto value : pi) {
std::cout << v << '\n';
}
So you are getting a pair of normal forward iterators wrapped in an iterator range object. The iterators can be used with any STL algorithms etc.
Note that the previous only applies to repeated packed fields, normal repeated fields are handled in the usual way for scalar fields.
Protocol Buffers can embed any message inside another message. To access an
embedded message use the get_message()
function. So for this description:
message Point {
required double x = 1;
required double y = 2;
}
message Example3 {
repeated Point point = 10;
}
you can parse with this code:
protozero::pbf_reader message{input};
while (message.next(10)) {
protozero::pbf_reader point = message.get_message();
double x, y;
while (point.next()) {
switch (point.tag()) {
case 1:
x = point.get_double();
break;
case 2:
y = point.get_double();
break;
default:
point.skip();
}
}
std::cout << "x=" << x << " y=" << y << "\n";
}
Enums are stored as varints and they can't be differentiated from them. Use
the get_enum()
function to get the value of the enum, you have to translate
this into the symbolic name yourself. See the enum
test case for an example.
Protozero uses assert()
liberally to help you find bugs in your own code when
compiled in debug mode (ie with NDEBUG
not set). If such an assert "fires",
this is a very strong indication that there is a bug in your code somewhere.
(Protozero will disable those asserts and "convert" them into exception in its own test code. This is done to make sure the asserts actually work as intended. Your test code will not need this!)
Exceptions, on the other hand, are thrown by Protozero if some kind of data corruption was detected while it is trying to parse the data. This could also be an indicator for a bug in the user code, but because it can happen if the data was (intentionally or not intentionally) been messed with, it is reported to the user code using exceptions.
Most of the functions on the writer side can throw a std::bad_alloc
exception if there is no space to grow a buffer. Other than that no exceptions
can occur on the writer side.
All exceptions thrown by the reader side derive from protozero::exception
.
Note that all exceptions can also happen if you are expecting a data field of
a certain type in your code but the field actually has a different type. In
that case the pbf_reader
class might interpret the bytes in the buffer in
the wrong way and anything can happen.
This will be thrown whenever any of the functions "runs out of input data". It means you either have an incomplete message in your input or some other data corruption has taken place.
This will be thrown if an unsupported wire type is encountered. Either your input data is corrupted or it was written with an unsupported version of a Protocol Buffers implementation.
This exception indicates an illegal encoding of a varint. It means your input data is corrupted in some way.
This exception is thrown when a tag has an invalid value. Tags must be unsigned integers between 1 and 2^29-1. Tags between 19000 and 19999 are not allowed. See https://developers.google.com/protocol-buffers/docs/proto#assigning-tags
This exception is thrown when a length field of a packed repeated field is invalid. For fixed size types the length must be a multiple of the size of the type.
The pbf_reader
class behaves like a value type. Objects are reasonably small
(two pointers and two uint32_t
, so 24 bytes on a 64bit system) and they can
be copied and moved around trivially.
pbf_reader
objects can be constructed from a std::string
or a const char*
and a length field (either supplied as separate arguments or as a std::pair
).
In all cases objects of the pbf_reader
class store a pointer into the input
data that was given to the constructor. You have to make sure this pointer
stays valid for the duration of the objects lifetime.
One problem in the code above are the "magic numbers" used as tags for the
different fields that you got from the .proto
file. Instead of spreading
these magic numbers around your code you can define them once in an enum class
and then use the pbf_message
template class instead of the
pbf_reader
class.
Here is the first example again, this time using this new technique. So you
have the following in a .proto
file:
message Example1 {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
Add the following declaration in one of your header files:
enum class Example1 : protozero::pbf_tag_type {
required_uint32_x = 1,
optional_string_s = 2,
repeated_fixed64_r = 17
};
The message name becomes the name of the enum class
which is always built
on top of the protozero::pbf_tag_type
type. Each field in the message
becomes one value of the enum. In this case the name is created from the
type (including the modifiers like required
or optional
) and the name of
the field. You can use any name you want, but this convention makes it easier
later, to get everything right.
To read messages created according to that description, you will have code that
looks somewhat like this, this time using pbf_message
instead of
pbf_reader
:
#include <protozero/pbf_message.hpp>
// get data from somewhere into the input string
std::string input = get_input_data();
// initialize pbf message with this data
protozero::pbf_message<Example1> message{input};
// iterate over fields in the message
while (message.next()) {
// switch depending on the field tag (the field name is not available)
switch (message.tag()) {
case Example1::required_uint32_x:
auto x = message.get_uint32();
break;
case Example1::optional_string_s:
std::string s = message.get_string();
break;
case Example1::repeated_fixed64_r:
message.skip();
break;
default:
// ignore data for unknown tags to allow for future extensions
message.skip();
}
}
Note the correspondance between the enum value (for instance
required_uint32_x
) and the name of the getter function (for instance
get_uint32()
). This makes it easier to get the correct types. Also the
naming makes it easier to keep different message types apart if you have
multiple (or embedded) messages.
See the test/t/complex
test case for a complete example using this interface.
Using pbf_message
in favour of pbf_reader
is recommended for all code.
Note that pbf_message
derives from pbf_reader
, so you can always fall
back to the more generic interface if necessary.
One problem you might run into is the following: The enum class lists all
possible values you know about and you'll have lots of switch
statements
checking those values. Some compilers will know that your switch
covers
all possible cases and warn you if you have a default
case that looks
unneccessary to the compiler. But you still want that default
case to allow
for future extension of those messages (and maybe also to detect corrupted
data). You can switch of this warning with -Wno-covered-switch-default
).
To use the pbf_writer
class, add this include to your C++ program:
#include <protozero/pbf_writer.hpp>
The pbf_writer
class contains asserts that will detect some programming
errors. We encourage you to compile with asserts enabled in your debug builds.
Lets say you have a protocol description in a .proto
file like this:
message Example {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
To write messages created according to that description, you will have code that looks somewhat like this:
#include <protozero/pbf_writer.hpp>
std::string data;
protozero::pbf_writer pbf_example{data};
pbf_example.add_uint32(1, 27); // uint32_t x
pbf_example.add_fixed64(17, 1); // fixed64 r
pbf_example.add_fixed64(17, 2);
pbf_example.add_fixed64(17, 3);
pbf_example.add_string(2, "foobar"); // string s
First you need a string which will be used as buffer to assemble the
protobuf-formatted message. The pbf_writer
object contains a reference to
this string buffer and through it you add data to that buffer piece by piece.
The buffer doesn't have to be empty, the pbf_writer
will simply append its
data to whatever is there already.
As you could see in the introductory example handling any kind of scalar field
is easy. The type of field doesn't matter and it doesn't matter whether it is
optional, required or repeated. You always call one of the add_TYPE()
method
on the pbf writer object.
The first parameter of these methods is always the tag of the field (the
field number) from the .proto
file. The second parameter is the value you
want to set. For the bytes
and string
types several versions of the add
method are available taking a const std::string&
or a const char*
and a
length.
For enum
types you have to use the numeric value as the symbolic names from
the .proto
file are not available.
Repeated packed fields can easily be set from a pair of iterators:
std::string data;
protozero::pbf_writer pw{data};
std::vector<int> v = { 1, 4, 9, 16, 25, 36 };
pw.add_packed_int32(1, std::begin(v), std::end(v));
If you don't have an iterator you can use the alternative form:
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_int32 field{pw, 1};
field.add_element(1);
field.add_element(10);
field.add_element(100);
}
Of course you can add as many elements as you want. If you add no elements at all, this code will still work, Protozero detects this special case and pretends you never even initialized this field.
The nested scope is important in this case, because the destructor of the
field
object will make sure the length stored inside the field is set to
the right value. You must close that scope before adding other fields to the
pw
pbf writer.
If you know how many elements you will add to the field and your field contains fixed length elements, you can tell Protozero and it can optimize this case:
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_fixed32 field{pw, 1, 2}; // exactly two elements
field.add_element(42);
field.add_element(13);
}
In this case you have to supply exactly as many elements as you promised, otherwise you will get a broken protobuf message.
This works for packed_field_fixed32
, packed_field_sfixed32
,
packed_field_fixed64
, packed_field_sfixed64
, packed_field_float
, and
packed_field_double
.
You can abandon writing of the packed field if this becomes necessary by
calling rollback()
:
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_int32 field{pw, 1};
field.add_element(42);
// some error occurs, you don't want to have this field at all
field.rollback();
}
The result is the same as if the lines inside the nested brackets had never
been called. Do not try to call add_element()
after a rollback.
Nested sub-messages can be handled by first creating the submessage and then adding to the parent message:
std::string buffer_sub;
protozero::pbf_writer pbf_sub{buffer_sub};
// add fields to sub-message
pbf_sub.add_...(...);
// ...
// sub-message is finished here
std::string buffer_parent;
protozero::pbf_writer pbf_parent{buffer_parent};
pbf_parent.add_message(1, buffer_sub);
This is easy to do but it has the drawback of needing a separate std::string
buffer. If this concerns you (and why would you use protozero and not the
Google protobuf library if it doesn't?) there is another way:
std::string data;
protozero::pbf_writer pbf_parent{data};
// optionally add fields to parent here
pbf_parent.add_...(...);
// open a new scope
{
// create new pbf_writer with parent and the tag (field number)
// as parameters
protozero::pbf_writer pbf_sub{pbf_parent, 1};
// add fields to sub here...
pbf_sub.add_...(...);
} // closing the scope will close the sub-message
// optionally add more fields to parent here
pbf_parent.add_...(...);
This can be nested arbitrarily deep.
Internally the sub-message writer re-uses the buffer from the parent. It
reserves enough space in the buffer to later write the length of the submessage
into it. It then adds the contents of the submessage to the buffer. When the
pbf_sub
writer is destructed the length of the submessage is calculated and
written in the reserved space. If less space was needed for the length field
than was available, the rest of the buffer is moved over a few bytes.
You can abandon writing of submessage if this becomes necessary by
calling rollback()
:
std::string data;
protozero::pbf_writer pbf_parent{data};
// open a new scope
{
// create new pbf_writer with parent and the tag (field number)
// as parameters
protozero::pbf_writer pbf_sub{pbf_parent, 1};
// add fields to sub here...
pbf_sub.add_...(...);
// some problem occurs and you want to abandon the submessage:
pbf_sub.rollback();
}
// optionally add more fields to parent here
pbf_parent.add_...(...);
The result is the same as if the lines inside the nested brackets had never
been called. Do not try to call any of the add_*
functions on the submessage
after a rollback.
Just like the pbf_message
template class wraps the pbf_reader
class, there
is a pbf_builder
template class wrapping the pbf_writer
class. It is
instantiated using the same enum class
described above and used exactly
like the pbf_writer
class but using the values of the enum instead of bare
integers.
See the test/t/complex
test case for a complete example using this interface.