Generate FileDescriptorProto accessors #55

PeterJohnson · 2023-07-20T17:04:41Z

Follow-up to #47. It turns out that all of the standard protobuf implementations for reflection operate at the file level (FileDescriptorProto) rather than the individual protobuf level. This has propagated into other file formats, e.g. MCAP stores protobuf schemas as FileDescriptorSets (a set of FileDescriptorProtos). This would involve generating, at a minimum, a public static byte[] getFileDescriptorProto() at the generated file level.

The reason I say "at a minimum" is because this alone does not provide direct visibility to the .proto file's dependencies, and exporting a complete set thus requires external (non-generated) information or parsing of the FileDescriptorProto. The dependencies are visible through the FileDescriptor's "dependency" repeated string, so could be exposed that way (as a generated String[] getter), or maybe something more Java'y (e.g. a Class<?>[], although that has some potential issues and may be more annoying than useful).

The text was updated successfully, but these errors were encountered:

ennerf · 2023-07-21T01:36:15Z

The combination does get a bit messy from an API perspective. Could you generate a descriptor file with protoc --descriptor_set_out=.desc and load it as a resource at runtime?

protoc --include_imports --descriptor_set_out=<filename>.desc <filename>.proto <filename>.proto <filename>.proto

PeterJohnson · 2023-07-21T15:02:20Z

To give a bit more detail... my use case is the following:

We provide a library that provides classes and protobuf serialization for those classes, and hooks for serializing protobuf-serializable classes to the network (or log files). We also provide applications on the other side of the network/log files to view what's been sent.
Third parties provide libraries that provide their own classes and protobufs for those classes that can depend on protobufs defined in our library
End users create applications and may or may not create their own protobufs (most do not). We want them to be able to transparently send either our classes or the third party classes over the network. We need the end user API to be extremely easy to use and somewhat dynamic in that the end user doesn't need to specify somewhere globally in their code what classes they're going to send, they just send it and the base library under the hood handles getting the descriptor published.
In the network case, multiple applications built and distributed by different parties (e.g. both third parties and end users) can connect and send data to each other. There is no guarantee of a common complete set of protobufs across these applications as they're built at different times by different people. There will be overlap (common individual protobufs from the base library or third party libraries) but the complete set will not be in common across all applications on the network. The network is pub/sub; individual topics have a type string and can be published/subscribed to by any application that knows how to talk to that type string.

So the problem is how to publish the descriptor data from the applications over the network (using the API provided by the base library) such that it is dynamically introspectable by tools.

We could have the build system for each application generate a file descriptor set for all of the descriptors for all of the libraries it uses (and anything the application itself has) and then publish that to the network to a unique location per application. It can't only publish the descriptors it actually uses because there's no way to get that information at build time or at runtime (with QB). This will result in duplication (as application 1 and application 2 will both publish file descriptor sets that contain base library descriptors, for example), and I'm not sure exactly how big the complete descriptor set is going to get (and this size is of course multiplied by the number of applications running). Tools will need to maintain separate descriptor databases for each application and figure out which descriptor to use from which database for a given type string, but that's relatively easy to do.

If each generated file provides access to its filename, file descriptor proto, and dependencies, we can walk this tree at runtime to either individually publish file descriptors or build/publish a file descriptor set (although that's maybe not possible given just a byte[] for a file descriptor)--this time with only the file descriptors that are actually used at runtime by that particular application. My original thought was to do common publishing of file descriptors (uniquely indexed only by file name), but the downside of this is different applications could publish different versions of the file and conflict with each other in a way that's not discoverable by tools (for debugging purposes--the tools don't actually know which application published a particular typed value, so have to pick one), so it may be better to also make this publish unique per application (effectively publishing a per-application file descriptor set), so the main thing that's being gained with this approach is the avoidance of duplication of all the (mostly unused--most applications will only use a tiny subset) library file descriptors. It also makes it substantially less likely that conflicts will arise between the file descriptors, because only the files actually being used by an application are getting published by that application.

PeterJohnson · 2023-07-25T15:50:43Z

Thinking about this more, I had an idea--generate the FileDescriptorSet at build time, load it at runtime, and also manipulate it at runtime by extracting only the FileDescriptorProto's we care about. To avoid pulling in the entire google upstream descriptors, I could have "lightweight" versions of those protobufs (e.g. my own version with only some of the fields defined), and I think as long as store_unknown_fields=true, I can use QB to parse the "lightweight" FileDescriptorSet/FileDescriptorProto and output them intact either individually or as a new FileDescriptorSet? I still need to think through how the generation process is going to work in the multiple-library-and-user-builds scenario, and whether for that reason it might still be beneficial to embed in the generated code instead.

ennerf · 2023-07-25T19:18:27Z

Sorry, I forgot to reply earlier.

I was thinking of generating FileDescriptorProto data in the parent wrapper file. I think the bytes for the nested message descriptors are all contained inside, so I think it should be possible to access the message-descriptors by offset and length. The messages then only need to generate a small wrapper and a String for the full identifier.

The FileDescriptor could potentially also provide a List or Map of all identifiers. Creating a reduced version of the Google Descriptors w/ storing unknown fields should work as well.

ennerf · 2023-08-04T07:38:32Z

Fyi, I just finished a large project I was working on. I need to do a few high-priority smaller items, but I should hopefully be able to get to this early next week.

PeterJohnson · 2023-09-17T14:33:38Z

Any update on this?

ennerf · 2023-09-19T16:24:04Z

Sorry, it got delayed a lot and I'm still a bit confused about the requirements.

From what I saw in the protobuf-java API I think you are looking for equivalents for the two methods:

String fullName = MyProtoType.getDescriptor().getFullName();
byte[]  fileDescriptor = MyProtoType.getDescriptor().getFile().toProto().toByteArray();

but I wonder how that works with protos that are defined in other files? I saw a MyProtoType.getDescriptor().getFile().getDependencies(), but how would those be serialized? Is there some top-level descriptor message that includes everything? Can you provide a code snippet that shows how you would write the descriptor with the official API?

PeterJohnson · 2023-09-20T04:14:55Z

What I'm currently doing in C++ is the following. Note I'm not actually publishing a FileDescriptorSet as such, I'm instead publishing (via the callback function fn) the FileDescriptor of each file for the whole dependency tree of file descriptors, starting from the proto's file descriptor.

static void ForEachProtobufDescriptorImpl(
    const FileDescriptor* desc,
    function_ref<bool(std::string_view typeString)> wants,
    function_ref<void(std::string_view typeString,
                      std::span<const uint8_t> schema)>
        fn,
    Arena* arena) {
  if (!wants(desc->name())) {
    return;
  }
  for (int i = 0, ndep = desc->dependency_count(); i < ndep; ++i) {
    ForEachProtobufDescriptorImpl(desc->dependency(i), wants, fn, arena);
  }
  FileDescriptorProto* descproto = Arena::CreateMessage<FileDescriptorProto>(arena);
  descproto->Clear();
  desc->CopyTo(descproto);
  std::vector<uint8_t> buf;
  detail::SerializeProtobuf(buf, *descproto);
  delete descproto;
  fn(fmt::format("proto:{}", desc->name()), buf);
}

void detail::ForEachProtobufSchema(
    const google::protobuf::Message& msg,
    function_ref<bool(std::string_view filename)> wants,
    function_ref<void(std::string_view filename,
                      std::span<const uint8_t> descriptor)>
        fn) {
  ForEachProtobufDescriptorImpl(msg.GetDescriptor()->file(), wants, fn,
                                msg.GetArena());
}

ennerf · 2023-09-20T10:42:41Z

Thanks. Would an API like below (reduced from protobuf-java) work?

class SomeGeneratedMessage extends ProtoMessage {
  public static Descriptor getDescriptor();
} 

interface Descriptor {
  FileDescriptor getFile();
  String getName();
  String getFullName();
  byte[] toProtoBytes();
}

interface FileDescriptor {
  String getName();
  String getFullName();
  String getPackage();
  byte[] toProtoBytes();
  List<FileDescriptor> getDependencies();
}

PeterJohnson · 2023-09-20T14:36:08Z

That looks great!

ennerf · 2023-09-21T00:54:26Z

I implemented an initial version of the API above. The generated code currently looks like this: https://gist.github.com/ennerf/222e68f6b6ac5fb2600c58ec35804457

with each message generating a

public static class MessageSetCorrectExtension2 {
    // ...
    public static Descriptors.Descriptor getDescriptor() {
        return AllTypesOuterClass.internal_static_quickbuf_unittest_TestAllTypes_MessageSetCorrectExtension2_descriptor;
    }
    // ...
}

ennerf · 2023-09-21T11:55:20Z

I also added FileDescriptor::getAllContainedTypes and FileDescriptor::getAllKnownTypes so it's easier to create a lookup table of all types and their dependencies. Please double-check PR #57.

I removed the gpg requirement, so you should be able to run mvn clean install for local testing.

PeterJohnson · 2023-09-22T04:18:14Z

Thanks! I'll try it out this weekend.

PeterJohnson · 2023-10-06T04:39:04Z

~~So mvn clean install builds the Java artifacts, but how do I build the matching protoc-gen-quickbuf?~~

Nevermind, figured it out.

ennerf · 2023-10-07T13:08:29Z

Sorry, I just got back from a conference. Do you have any more suggestions or is the PR state good as is?

PeterJohnson · 2023-10-07T14:56:47Z

It’s good as is for what I need. Thanks!

ennerf · 2023-10-07T15:32:50Z

Thanks for verifying. I'll get a release out soon.

ennerf · 2023-10-09T15:01:11Z

version 1.3.2 is on maven central

ennerf added a commit that referenced this issue Sep 21, 2023

implemented proto descriptors as discussed in #55

1d15974

ennerf added a commit that referenced this issue Sep 21, 2023

implemented proto descriptors as discussed in #55

335617f

ennerf mentioned this issue Sep 21, 2023

Added descriptor API #57

Merged

ennerf closed this as completed Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate FileDescriptorProto accessors #55

Generate FileDescriptorProto accessors #55

PeterJohnson commented Jul 20, 2023

ennerf commented Jul 21, 2023 •

edited

Loading

PeterJohnson commented Jul 21, 2023 •

edited

Loading

PeterJohnson commented Jul 25, 2023 •

edited

Loading

ennerf commented Jul 25, 2023 •

edited

Loading

ennerf commented Aug 4, 2023

PeterJohnson commented Sep 17, 2023

ennerf commented Sep 19, 2023 •

edited

Loading

PeterJohnson commented Sep 20, 2023 •

edited

Loading

ennerf commented Sep 20, 2023

PeterJohnson commented Sep 20, 2023

ennerf commented Sep 21, 2023

ennerf commented Sep 21, 2023

PeterJohnson commented Sep 22, 2023

PeterJohnson commented Oct 6, 2023 •

edited

Loading

ennerf commented Oct 7, 2023

PeterJohnson commented Oct 7, 2023

ennerf commented Oct 7, 2023

ennerf commented Oct 9, 2023

Generate FileDescriptorProto accessors #55

Generate FileDescriptorProto accessors #55

Comments

PeterJohnson commented Jul 20, 2023

ennerf commented Jul 21, 2023 • edited Loading

PeterJohnson commented Jul 21, 2023 • edited Loading

PeterJohnson commented Jul 25, 2023 • edited Loading

ennerf commented Jul 25, 2023 • edited Loading

ennerf commented Aug 4, 2023

PeterJohnson commented Sep 17, 2023

ennerf commented Sep 19, 2023 • edited Loading

PeterJohnson commented Sep 20, 2023 • edited Loading

ennerf commented Sep 20, 2023

PeterJohnson commented Sep 20, 2023

ennerf commented Sep 21, 2023

ennerf commented Sep 21, 2023

PeterJohnson commented Sep 22, 2023

PeterJohnson commented Oct 6, 2023 • edited Loading

ennerf commented Oct 7, 2023

PeterJohnson commented Oct 7, 2023

ennerf commented Oct 7, 2023

ennerf commented Oct 9, 2023

ennerf commented Jul 21, 2023 •

edited

Loading

PeterJohnson commented Jul 21, 2023 •

edited

Loading

PeterJohnson commented Jul 25, 2023 •

edited

Loading

ennerf commented Jul 25, 2023 •

edited

Loading

ennerf commented Sep 19, 2023 •

edited

Loading

PeterJohnson commented Sep 20, 2023 •

edited

Loading

PeterJohnson commented Oct 6, 2023 •

edited

Loading