Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate FileDescriptorProto accessors #55

Closed
PeterJohnson opened this issue Jul 20, 2023 · 18 comments
Closed

Generate FileDescriptorProto accessors #55

PeterJohnson opened this issue Jul 20, 2023 · 18 comments

Comments

@PeterJohnson
Copy link

Follow-up to #47. It turns out that all of the standard protobuf implementations for reflection operate at the file level (FileDescriptorProto) rather than the individual protobuf level. This has propagated into other file formats, e.g. MCAP stores protobuf schemas as FileDescriptorSets (a set of FileDescriptorProtos). This would involve generating, at a minimum, a public static byte[] getFileDescriptorProto() at the generated file level.

The reason I say "at a minimum" is because this alone does not provide direct visibility to the .proto file's dependencies, and exporting a complete set thus requires external (non-generated) information or parsing of the FileDescriptorProto. The dependencies are visible through the FileDescriptor's "dependency" repeated string, so could be exposed that way (as a generated String[] getter), or maybe something more Java'y (e.g. a Class<?>[], although that has some potential issues and may be more annoying than useful).

@ennerf
Copy link
Collaborator

ennerf commented Jul 21, 2023

The combination does get a bit messy from an API perspective. Could you generate a descriptor file with protoc --descriptor_set_out=.desc and load it as a resource at runtime?

protoc --include_imports --descriptor_set_out=<filename>.desc <filename>.proto <filename>.proto <filename>.proto

@PeterJohnson
Copy link
Author

PeterJohnson commented Jul 21, 2023

To give a bit more detail... my use case is the following:

  1. We provide a library that provides classes and protobuf serialization for those classes, and hooks for serializing protobuf-serializable classes to the network (or log files). We also provide applications on the other side of the network/log files to view what's been sent.
  2. Third parties provide libraries that provide their own classes and protobufs for those classes that can depend on protobufs defined in our library
  3. End users create applications and may or may not create their own protobufs (most do not). We want them to be able to transparently send either our classes or the third party classes over the network. We need the end user API to be extremely easy to use and somewhat dynamic in that the end user doesn't need to specify somewhere globally in their code what classes they're going to send, they just send it and the base library under the hood handles getting the descriptor published.
  4. In the network case, multiple applications built and distributed by different parties (e.g. both third parties and end users) can connect and send data to each other. There is no guarantee of a common complete set of protobufs across these applications as they're built at different times by different people. There will be overlap (common individual protobufs from the base library or third party libraries) but the complete set will not be in common across all applications on the network. The network is pub/sub; individual topics have a type string and can be published/subscribed to by any application that knows how to talk to that type string.

So the problem is how to publish the descriptor data from the applications over the network (using the API provided by the base library) such that it is dynamically introspectable by tools.

We could have the build system for each application generate a file descriptor set for all of the descriptors for all of the libraries it uses (and anything the application itself has) and then publish that to the network to a unique location per application. It can't only publish the descriptors it actually uses because there's no way to get that information at build time or at runtime (with QB). This will result in duplication (as application 1 and application 2 will both publish file descriptor sets that contain base library descriptors, for example), and I'm not sure exactly how big the complete descriptor set is going to get (and this size is of course multiplied by the number of applications running). Tools will need to maintain separate descriptor databases for each application and figure out which descriptor to use from which database for a given type string, but that's relatively easy to do.

If each generated file provides access to its filename, file descriptor proto, and dependencies, we can walk this tree at runtime to either individually publish file descriptors or build/publish a file descriptor set (although that's maybe not possible given just a byte[] for a file descriptor)--this time with only the file descriptors that are actually used at runtime by that particular application. My original thought was to do common publishing of file descriptors (uniquely indexed only by file name), but the downside of this is different applications could publish different versions of the file and conflict with each other in a way that's not discoverable by tools (for debugging purposes--the tools don't actually know which application published a particular typed value, so have to pick one), so it may be better to also make this publish unique per application (effectively publishing a per-application file descriptor set), so the main thing that's being gained with this approach is the avoidance of duplication of all the (mostly unused--most applications will only use a tiny subset) library file descriptors. It also makes it substantially less likely that conflicts will arise between the file descriptors, because only the files actually being used by an application are getting published by that application.

@PeterJohnson
Copy link
Author

PeterJohnson commented Jul 25, 2023

Thinking about this more, I had an idea--generate the FileDescriptorSet at build time, load it at runtime, and also manipulate it at runtime by extracting only the FileDescriptorProto's we care about. To avoid pulling in the entire google upstream descriptors, I could have "lightweight" versions of those protobufs (e.g. my own version with only some of the fields defined), and I think as long as store_unknown_fields=true, I can use QB to parse the "lightweight" FileDescriptorSet/FileDescriptorProto and output them intact either individually or as a new FileDescriptorSet? I still need to think through how the generation process is going to work in the multiple-library-and-user-builds scenario, and whether for that reason it might still be beneficial to embed in the generated code instead.

@ennerf
Copy link
Collaborator

ennerf commented Jul 25, 2023

Sorry, I forgot to reply earlier.

I was thinking of generating FileDescriptorProto data in the parent wrapper file. I think the bytes for the nested message descriptors are all contained inside, so I think it should be possible to access the message-descriptors by offset and length. The messages then only need to generate a small wrapper and a String for the full identifier.

The FileDescriptor could potentially also provide a List or Map of all identifiers. Creating a reduced version of the Google Descriptors w/ storing unknown fields should work as well.

@ennerf
Copy link
Collaborator

ennerf commented Aug 4, 2023

Fyi, I just finished a large project I was working on. I need to do a few high-priority smaller items, but I should hopefully be able to get to this early next week.

@PeterJohnson
Copy link
Author

Any update on this?

@ennerf
Copy link
Collaborator

ennerf commented Sep 19, 2023

Sorry, it got delayed a lot and I'm still a bit confused about the requirements.

From what I saw in the protobuf-java API I think you are looking for equivalents for the two methods:

String fullName = MyProtoType.getDescriptor().getFullName();
byte[]  fileDescriptor = MyProtoType.getDescriptor().getFile().toProto().toByteArray();

but I wonder how that works with protos that are defined in other files? I saw a MyProtoType.getDescriptor().getFile().getDependencies(), but how would those be serialized? Is there some top-level descriptor message that includes everything? Can you provide a code snippet that shows how you would write the descriptor with the official API?

@PeterJohnson
Copy link
Author

PeterJohnson commented Sep 20, 2023

What I'm currently doing in C++ is the following. Note I'm not actually publishing a FileDescriptorSet as such, I'm instead publishing (via the callback function fn) the FileDescriptor of each file for the whole dependency tree of file descriptors, starting from the proto's file descriptor.

static void ForEachProtobufDescriptorImpl(
    const FileDescriptor* desc,
    function_ref<bool(std::string_view typeString)> wants,
    function_ref<void(std::string_view typeString,
                      std::span<const uint8_t> schema)>
        fn,
    Arena* arena) {
  if (!wants(desc->name())) {
    return;
  }
  for (int i = 0, ndep = desc->dependency_count(); i < ndep; ++i) {
    ForEachProtobufDescriptorImpl(desc->dependency(i), wants, fn, arena);
  }
  FileDescriptorProto* descproto = Arena::CreateMessage<FileDescriptorProto>(arena);
  descproto->Clear();
  desc->CopyTo(descproto);
  std::vector<uint8_t> buf;
  detail::SerializeProtobuf(buf, *descproto);
  delete descproto;
  fn(fmt::format("proto:{}", desc->name()), buf);
}

void detail::ForEachProtobufSchema(
    const google::protobuf::Message& msg,
    function_ref<bool(std::string_view filename)> wants,
    function_ref<void(std::string_view filename,
                      std::span<const uint8_t> descriptor)>
        fn) {
  ForEachProtobufDescriptorImpl(msg.GetDescriptor()->file(), wants, fn,
                                msg.GetArena());
}

@ennerf
Copy link
Collaborator

ennerf commented Sep 20, 2023

Thanks. Would an API like below (reduced from protobuf-java) work?

class SomeGeneratedMessage extends ProtoMessage {
  public static Descriptor getDescriptor();
} 

interface Descriptor {
  FileDescriptor getFile();
  String getName();
  String getFullName();
  byte[] toProtoBytes();
}

interface FileDescriptor {
  String getName();
  String getFullName();
  String getPackage();
  byte[] toProtoBytes();
  List<FileDescriptor> getDependencies();
}

@PeterJohnson
Copy link
Author

That looks great!

@ennerf
Copy link
Collaborator

ennerf commented Sep 21, 2023

I implemented an initial version of the API above. The generated code currently looks like this: https://gist.github.com/ennerf/222e68f6b6ac5fb2600c58ec35804457

with each message generating a

public static class MessageSetCorrectExtension2 {
    // ...
    public static Descriptors.Descriptor getDescriptor() {
        return AllTypesOuterClass.internal_static_quickbuf_unittest_TestAllTypes_MessageSetCorrectExtension2_descriptor;
    }
    // ...
}

@ennerf
Copy link
Collaborator

ennerf commented Sep 21, 2023

I also added FileDescriptor::getAllContainedTypes and FileDescriptor::getAllKnownTypes so it's easier to create a lookup table of all types and their dependencies. Please double-check PR #57.

I removed the gpg requirement, so you should be able to run mvn clean install for local testing.

@PeterJohnson
Copy link
Author

Thanks! I'll try it out this weekend.

@PeterJohnson
Copy link
Author

PeterJohnson commented Oct 6, 2023

So mvn clean install builds the Java artifacts, but how do I build the matching protoc-gen-quickbuf?

Nevermind, figured it out.

@ennerf
Copy link
Collaborator

ennerf commented Oct 7, 2023

Sorry, I just got back from a conference. Do you have any more suggestions or is the PR state good as is?

@PeterJohnson
Copy link
Author

It’s good as is for what I need. Thanks!

@ennerf
Copy link
Collaborator

ennerf commented Oct 7, 2023

Thanks for verifying. I'll get a release out soon.

@ennerf ennerf closed this as completed Oct 7, 2023
@ennerf
Copy link
Collaborator

ennerf commented Oct 9, 2023

version 1.3.2 is on maven central

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants