Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Serialization with default value #861

Open
seanliu1 opened this issue Apr 3, 2019 · 60 comments
Open

JSON Serialization with default value #861

seanliu1 opened this issue Apr 3, 2019 · 60 comments

Comments

@seanliu1
Copy link

seanliu1 commented Apr 3, 2019

Developing using iOS12, Swift 5, proto3 . I am about to add an extension which can support to output fields with their default values. I just want to check whether it is already implemented.

Based on proto doc, it looks like

JSON options
A proto3 JSON implementation may provide the following options:

Emit fields with default values: Fields with default values are omitted by default in proto3 JSON output. An implementation may provide an option to override this behavior and output fields with their default values.

I wonder does swift version has option to output fileds with their default values. I found python version has it MessageToJson(message, including_default_value_fields=False)

https://developers.google.com/protocol-buffers/docs/reference/python/google.protobuf.json_format-module

@thomasvl
Copy link
Collaborator

thomasvl commented Apr 3, 2019

@thomasvl
Copy link
Collaborator

thomasvl commented Apr 3, 2019

p.s. - adding this as an option would be tricky since JSON encoding is built on the visitor pattern which is where it is skipping default values.

@seanliu1
Copy link
Author

seanliu1 commented Apr 3, 2019

@thomasvl thanks for your quick reply. Yes, I realized that default value is skipped in the generated code.

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if !self.cards.isEmpty {
      try visitor.visitRepeatedMessageField(value: self.homeCards, fieldNumber: 1)
    }
    if self.test != false {
      try visitor.visitSingularBoolField(value: self.test, fieldNumber: 2)
    }
    try unknownFields.traverse(visitor: &visitor)
  }

@tbkka
Copy link
Contributor

tbkka commented Apr 3, 2019

Is there a way to do this without breaking the API between generated code and the library?

I see how old generated code could continue to work with an updated library, but I don't see how new generated code (with modified default handling) could work with an old library.

If we can't preserve this compatibility, we would probably want to increment the major version. It's been a while since 1.0, so this is probably okay. We should certainly consider whether there are other issues that might require API breakage that should be adopted along with this.

@seanliu1
Copy link
Author

seanliu1 commented Apr 3, 2019

What I found is that current we do the check in codegen xx.pb.swift.

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if self.showID != 0 {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
    try unknownFields.traverse(visitor: &visitor)
  }

What if we add an option to JSONEncodingOptions, then

we change

  mutating func visitSingularUInt32Field(value: UInt32, fieldNumber: Int) throws {

     if value != 0 || options.includeDefaultParameters {
            try startField(for: fieldNumber)
            encoder.putUInt32(value: value)
        }
}

Then API will remain same, and then we do check inside actual visitSingularUInt32Field or visitSingularUInt64Field or visitSingularStringField

@tbkka
Copy link
Contributor

tbkka commented Apr 3, 2019

This is a good idea. I'd need to think about how this works for proto2, though.

@thomasvl
Copy link
Collaborator

thomasvl commented Apr 3, 2019

The visitor api shouldn't know about the JSONEncodingOptions, we'd need to make a some other (internal) "options" to expose what ever subset makes sense for all possible visitors.

@seanliu1
Copy link
Author

seanliu1 commented Apr 3, 2019

I think JSONEncodingVisitor currently owns JSONEncodingOptions, and we are using JSONEncodingOptions.alwaysPrintEnumsAsInts in the JSONEncodingVisitor Implementation. If this is something we want to support across all visitors, you are right, it probably should some other internal options.

 mutating func visitSingularEnumField<E: Enum>(value: E, fieldNumber: Int) throws {
    try startField(for: fieldNumber)
    if !options.alwaysPrintEnumsAsInts, let n = value.name {
      encoder.appendQuoted(name: n)
    } else {
      encoder.putEnumInt(value: value.rawValue)
    }
  }

@thomasvl
Copy link
Collaborator

thomasvl commented Apr 3, 2019

The call to JSONEncodingVisitor is happening on the generic visitor interface and that call is blocked on the value not being the default, so all the JSON specific types are removed from where the change is needed.

@seanliu1
Copy link
Author

seanliu1 commented Apr 3, 2019

The call to JSONEncodingVisitor is happening on the generic visitor interface and that call is blocked on the value not being the default, so all the JSON specific types are removed from where the change is needed.

Oh, I see. Yes, the original approach I proposed required all visitor to include default value check, which is not really flexible. Another way in my mind.

We add a function into Visitor protocol

protocol Visitor {
    func shouldInlcludeDefault() ->Bool
    ....
}

We provide default implementations as other methods in Visitor protocol

extension Visitor {
    func shouldInlcludeDefault() ->Bool {
        return false
    }
    ...
}

In JSONEncodingVisitor, we can implement it by checking shouldIncludeDefaultValue from options.

The last change will be codegen.

All traverse function will require to call

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if self.id != 0 || visitor.shouldIncludeDefaultValue() {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
}

I guess this can be also used by formatTextString.

@tbkka @thomasvl let me know what do you think.

@tbkka
Copy link
Contributor

tbkka commented Apr 3, 2019

@seanliu1's idea would have:

  • Old generated code with new library: default values would always be omitted
  • New generated code with old library: default values would always be included

I'll have to think about whether this is enough compatibility to avoid the major version bump.

@seanliu1
Copy link
Author

seanliu1 commented Apr 3, 2019

@tbkka see this comments. #861 (comment)

People wont notice any change, unless shouldIncludeDefaultValue is provided.
So

Old generated code with new library: default values would always be omitted
New generated code with old library: default values would always be omitted

@thomasvl
Copy link
Collaborator

thomasvl commented Apr 4, 2019

Likely should make a sweep of the options the other languages provide to TextFormat, JSON, and Binary to double checks if there are potentially any other things we might want to support that also would reach to this level of the Swift impl.

Performance wise, we could call this once within each generated visitor method, but it is still a lot of calls if the graph has a lot of sub messages, and the call can't be optimized since it is on a generic Visitor.

@seanliu1
Copy link
Author

seanliu1 commented Apr 4, 2019

Let me know if I can help anything.

@seanliu1
Copy link
Author

seanliu1 commented May 9, 2019

I managed to make some changes based on my initial idea.

  1. I moved all default check from codegen to the visitor. Currently I only put default check for json encoding visitor, if this is a viable approach I will add check to all visitors
  2. I add new JsonEncodingOption to include default value
  3. I also changed code gen, so that proto2 still works fine. It will still check nil for sub message for proto3.

seanliu1#1

I also have the chance to look at

Python and Java all use descriptor to loop through all fields, and insert default value to the dictionary or map.
Python version does not support output Singular message fields and oneof fields default value.

@thomasvl
Copy link
Collaborator

thomasvl commented May 9, 2019

Taking a quick look, the visitor seems to have special cased zero/false as the default. That's only true for a message declared in a file with proto3 syntax. A message in proto2 syntax can have any default the .proto file author chooses, so I'm not sure your flow would work out correctly when a proto2 syntax message is used as a field of a proto3 syntax message.

@seanliu1
Copy link
Author

seanliu1 commented May 9, 2019

Thanks for taking a look. I forgot proto2 can have any default value.
I can think of three workarounds

  1. In order to access stored property, it looks like we need to pass it to all visiting methods. Signature will change to

mutating func visitSingularEnumField<E: Enum>(value: E, default E, fieldNumber: Int) throws

  1. Change signature for traverse, add include_default

func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, includeDefault:Bool == false) throws

Only JsonEncoderVisitor will call
try traverse(visitor: &visitor, includeDefault: options.includeDefault)

Other callsites will remain same.

  1. change private let options: JSONEncodingOptions to internal let options: JSONEncodingOptions in JSONEncodingVisitor, traverse can directly access it.

  2. As I mentioned before, add a new protocol to Visitor. func shouldIncludeDefault()->Bool Default implementation is return false, JsonEncodingVisitor provides its own implementation.Here is a PR for it [JSON Serialization]Support Default Value seanliu1/swift-protobuf#2

@seanliu1
Copy link
Author

seanliu1 commented May 9, 2019

I also looked at other options provided by Java, Python, C++

JSON

  • perserveProtoName
  • printEnumsAsInts
  • includeDefault (most languages only handle primitive type )

Looks like all languages are on the same page, Swift just does not have includeDefault

TextFormat

  • All support printUnknownFields

  • Python provides lots of options for formatting, for example as one line (no new line character), use angle brackets instead of curly braces for nesting See more But I do not think those changes need to touch this level, usually it can be added only for TextFormatVisitor

  • C++ and Java can do PrintFieldValueToString given field. C++ will also print the default value,

Binary
Looks pretty similar across those languages.

@seanliu1
Copy link
Author

seanliu1 commented May 9, 2019

Would love some feedback for this seanliu1#2 Thanks

@thomasvl
Copy link
Collaborator

thomasvl commented May 9, 2019

Left some quick comments, the repeated calls to the visitor also have the potential to slow things down when most fields are the defaults. It might make sense to fetch it once and reuse it in the conditionals (you could also check the flag first since when it is on, wether the value is set or not, you are calling the visit method).

But this approach goes back to @tbkka comments before, there are behavior issues with old generated code against the new runtime (and new generated code against the old). So the question of a version change still has to be sorted out. And if there is a version change, then making this a method on visitor vs. doing something more generic like passing options for it into the generated visited code might make things easier to add others in the future (although the same interaction problems would likely exist).

@seanliu1 seanliu1 changed the title json string with default value JSON Serialization with default value May 9, 2019
@seanliu1
Copy link
Author

seanliu1 commented May 9, 2019

@thomasvl Thanks, I addressed your comment on repeated calls to the visitor, currently it will only call once at beginning of traverse.
Since it involves API change and version management, I will leave the PR open until version bump is sorted out.

@tbkka
Copy link
Contributor

tbkka commented May 9, 2019

I think I might have figured out part of how to solve the versioning issue. We could generate two different traverse methods, one that provides the old behavior for use with the old libraries, and one that supports the new behavior. This might look like:

  // Called by old libraries.  Never visits default values.
  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    traverse(visitor: &visitor, visitDefault: false)
  }

  // Called by new libraries.  The caller specifies whether default values are visited.
  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, visitDefault: Bool) throws {
    if self.id != 0 || visitDefault {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
  }

The expected behavior: If you're using an old library or old generated code, defaults are not visited. Defaults can only be controlled if you have a new library and new generated code. The above solves three of the four cases:

  1. Old library, old generated code. This already works.
  2. Old library, new generated code. The old library calls traverse(visitor:) in new generated code. The result never visits default values, but that's the expected behavior when using the old library.
  3. New library, new generated code. The new library calls traverse(visitor:visitDefault:) in new generated code. That allows the new library code to select whether defaults are visited.

The remaining case is for the new library used with old generated code. This requires a way for the new library to call the old traverse(visitor:) method when the new traverse(visitor:visitDefault:) method isn't available. Is there a way to solve this with generics trickery or maybe adding a new protocol?

@thomasvl
Copy link
Collaborator

Hm, so we'd have to generate a travers shim in ever message class, which on all new library uses is wasted code (and if declared public, won't dead strip).

For your last case, could the updated library provide an extension to Message with a default impl of the new traverse method that calls the old? Or would that always get used for any generic operation on Messages directly?

@tbkka
Copy link
Contributor

tbkka commented May 10, 2019

... shim in every message class, which on all new library uses is wasted code ...

At this moment, I'm happy if we can find any solution that works. I agree it would be nice to avoid this overhead. Certainly, that extra shim could be omitted the next time we change the major version.

... could the updated library provide an extension to Message with a default impl of the new traverse method that calls the old? Or would that always get used for any generic operation on Messages directly?

Good idea! I think that would do it. I think generic uses and protocol-typed variables will work correctly as long as we declare both traverse methods on the Message protocol.

@CarlosNano
Copy link

Any update on this for 2.0? Thanks

@thomasvl
Copy link
Collaborator

Same as last check, no one has proposed a PR for this.

@CarlosNano
Copy link

Thanks @thomasvl

I've seen @seanliu1 achieved something in here: seanliu1#2

Is there any documentation of how to do the setup to get default values in the JSON?

Thanks

@thomasvl
Copy link
Collaborator

I've seen @seanliu1 achieved something in here: seanliu1#2

Is there any documentation of how to do the setup to get default values in the JSON?

The default values would be on the FieldDescriptor. I haven't looked at that change, but if the have something working, it likely would just need to be update to the current main here, i.e. - it should be basically what is needed.

@mrabiciu
Copy link

mrabiciu commented Nov 4, 2023

Are we okay with making breaking changes to the Message type in a pr since we're working towards a 2.0 version? If so I can work on something that looks like traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, options: [VisitorOptions])

@mrabiciu
Copy link

mrabiciu commented Nov 4, 2023

Actually after playing with this a bit I think an options argument to the traverse method would fall into the same backwards compatibility pitfalls that we encountered in this thread. In order to future proof this I propose we define a type-erased VisitorNode type

public struct VisitorNode {

    private let _visit: (inout Visitor, VisitorOptions) throws -> Void
    
    internal func visit(visitor: inout Visitor, options: VisitorOptions) throws {
        try _visit(&visitor, options)
    }

    // Factories
    
    public static func singularFloat(_ value: Float, fieldNumber: Int, defaultValue: Float) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularFloatField(value: value, fieldNumber: fieldNumber)
            }
        }
    }
    
    public static func singularEnum<E: SwiftProtobuf.Enum>(value: E, fieldNumber: Int, defaultValue: E) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularEnumField(value: value, fieldNumber: fieldNumber)
            }
        }
    }
    
    public static func singularMessage<M: Message & Equatable>(value: M, fieldNumber: Int, defaultValue: M) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularMessageField(value: value, fieldNumber: fieldNumber)
            }
        }
    }
    // etc ...
}

And have the Message protocol provide visitor nodes instead of providing the traverse implementation.

extension SomeProto {
    var visitorNodes: [VisitorNode] {
        [
            .singularEnum(value: someEnum, fieldNumber: 0 defaultValue: .unspecified),
            .singularFloat(value: someFloat, fieldNumber: 1, defaultValue: 0),
            .singularMessage(value: someMessage, fieldNumber: 2, defaultValue: .init())
        ]
    }
}

This allows us to have a common shared traverse implementation

func traverse<V: Visitor>(inout visitor: V, options: VisitorOptions) throws {
    for node in visitorNodes {
        try node.visit(visitor: &visitor, options: options)
    }
}

this allows the library to have more flexibility to change the internals of the visitor pattern regardless of which version of the library was used to generate the type conforming to Message

I'm still pretty new to this project so I'm not sure if there are potential pitfalls with this approach. I'm not sure if what I'm proposing would trigger issues like this one

@thomasvl
Copy link
Collaborator

thomasvl commented Nov 6, 2023

Are we okay with making breaking changes to the Message type in a pr since we're working towards a 2.0 version? If so I can work on something that looks like traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, options: [VisitorOptions])

Yup, 2.0 is allowed breaking changes.

As to your other questions - I'll defer to the Apple folks on the general api changes. @tbkka @FranzBusch @Lukasa

I do worry a little about how much the compiler will be able to inline compared to the current code and it also seems like you concern about stack size might be justified.

Having said that, I do sorta like the idea of only having to generate something more Array like that might hopefully be less actual code, reducing the overall size of the generated code.

@mrabiciu
Copy link

mrabiciu commented Nov 6, 2023

Sounds good I'll put together a PR this week and we can discuss more there, thanks for the quick reply!

@tbkka
Copy link
Contributor

tbkka commented Nov 6, 2023

func traverse<V: Visitor>(inout visitor: V, options: VisitorOptions) throws {
    for node in visitorNodes {
        try node.visit(visitor: &visitor, options: options)
    }
}

I developed the current API after performance-testing a lot of different alternatives. But I must confess, I never tried something with quite this shape, so I'm very curious to see how well it would work. Like Thomas, I'd be willing to accept a modest performance loss if this could significantly reduce code size. Certainly worth the experiment!

I suspect the first performance challenge will be to ensure the array of nodes is statically allocated just once for each message type.

@mrabiciu
Copy link

mrabiciu commented Nov 10, 2023

I suspect the first performance challenge will be to ensure the array of nodes is statically allocated just once for each message type.

I'll have to think about this a bit. On the surface level static allocation doesn't really make sense since each node is capturing the value of a given field. Maybe if nodes capture key paths this could work 🤔

I'd be willing to accept a modest performance loss if this could significantly reduce code size.

Do you have any recommendations for ways I can capture the relative performance of my solution? Are there some good sample protos I could use or some benchmark test cases I could run?

@tbkka
Copy link
Contributor

tbkka commented Nov 10, 2023

On the surface level static allocation doesn't really make sense since each node is capturing the value of a given field. Maybe if nodes capture key paths this could work 🤔

Oh. I missed that. Hmmm.... That means this array is getting built every time you try to walk the structure? That's going to be hard for binary serialization, in particular, since we actually walk the structure several times (to compute size information that's used to allocate the necessary buffers). And building this array will require a bunch of allocations: a couple for the array itself, and likely one more for each captured existential.

Key paths would be worth experimenting with, I think.

Do you have any recommendations for ways I can capture the relative performance of my solution? Are there some good sample protos I could use or some benchmark test cases I could run?

We have a Performance suite that Tony Allevato wrote and which I've relied on a lot in the past. It's a collection of shell scripts that can generate standard protos and then compiles and runs benchmarks for both C++ and Swift. Most importantly, it makes some very pretty graphs of the results. 😁 Pretty handy, though the shell scripting for compiling the C++ benchmarks keeps breaking as the protobuf folks keep churning their build system.

(Note: A few years back, Thomas suggested we switch the binary encoder to serialize from back-to-front in the buffer; that would only require a single walk to size the entire output, then a walk to actually write the data into the buffer. I keep meaning to set aside time to try implementing this approach, which would also address a more serious performance cliff we currently have in the binary encoder.)

@mrabiciu
Copy link

I played around with keypaths and I think they should work! I've got a rough prototype have been able to make this work for every field type.

Here is the general shape I've got so far

struct FieldNode<M: Message> {
    private let fieldNumber: Int
    private let subtype: Subtype
    
    private enum Subtype {
        case singularFloat(keypath: KeyPath<M, Float>, defaultValue: Float)
        // ... 
    }
    
    func traverse<V: Visitor>(message: M, using visitor: inout V, options: VisitorOptions) throws {
        switch subtype {
        case .singularFloat(let keypath, let defaultValue):
            let value = message[keyPath: keypath]
            if options.contains(.visitDefaults) || value != defaultValue {
                try visitor.visitSingularFloatField(value: value, fieldNumber: fieldNumber)
            }
        // ...
        }
    }
    
    // MARK: - Factories
    
    public static func singularFloat(_ keyPath: KeyPath<M, Float>, fieldNumber: Int, defaultValue: Float) -> Self {
        Self(fieldNumber: fieldNumber, subtype: .singularFloat(keypath: keyPath, defaultValue: defaultValue))
    }
    // ...
}

extension FooType: Message {    
    static let fieldNodes: [FieldNode<Self>] = [
        .singularFloat(\.someFloat, fieldNumber: 1, defaultValue: 0)
    ]
}

Message/Enum/Map fields are a little trickier but I do have something working for those too.

Accessing the fields through a keypath doesn't seem to have a significant performance impact compared to accessing fields directly.
I think swift switch statements over enums are O(1) but I need to confirm this as that could be a performance bottleneck.

If we push this concept a little further we could actually use this to get rid of _protobuf_nameMap and potentially save even more space by consolidating everything into fieldNodes by including the naming information in the field node with an API like this.

static let fieldNodes: [Int: FieldNode<Self>] = [
    1: .singularFloat(\.someFloat, defaultValue: 0, name: .same("some_float"))
]

We'd probably need to use an ordered dictionary though since I assume its important that visitor visit fields in order and we wouldn't want to do any sorting at runtime.

@tbkka
Copy link
Contributor

tbkka commented Nov 10, 2023

This is very promising! We'd need to see some measurements (performance and code size) to see how to best take advantage of this, of course. For example, if it's a big code size win but also a significant performance regression, then we might need to give people the option of generating either old-style (fast) or new-style (compact) output. But we'd have to see some actual numbers to inform any such planning. Of course, we all hope that🤞 it's a code size win without being a performance regression -- that would be truly wonderful!

This might also give us a way to drop the generated Equatable conformances, which is another source of code bloat; walking this list of field info would let us compare two messages with a general iterator instead of bespoke code.

@tbkka
Copy link
Contributor

tbkka commented Nov 10, 2023

You might not need a dictionary to replace _protobuf_nameMap, actually. That name map is a dictionary today because the current traversal provides a field number that needs to get translated. If the encoding is being driven by an array of field nodes, then those nodes would give both field number and name, so you would not need a translation per se. And arrays are more compact and faster to traverse than dictionaries. Hmmm.... There is the question of decoding, though. Hmmm...

@mrabiciu
Copy link

Yes exactly, you'd still need to the ability to do a lookup for decoding purposes from what I can tell.

@mrabiciu
Copy link

Next step for me is to get this code out of my custom plugin I've written and into a fork of swift-protobuf and get started with some performance testing. I'm not looking forward to having to re-generate all the pb.swift files in this repo 😅

I'll report back my findings when I have some 😊, thank you for your feedback so far!

@thomasvl
Copy link
Collaborator

One catch for visit is the intermixing of extension ranges. Not sure if you want markers for that in the list of fields, or if we capture the info in another list and then walk the two in parallel to mix the data? (unknown fields just get put at the end, not in order).

As far are regeneration goes, once the generator is updated, the Makefile has targets to do that.

@thomasvl
Copy link
Collaborator

Actually, we also need to double check how other languages do this option with respect to field presence. i.e. - if the field has presence, does the flag actually do something, or is the flag only honored when the field doesn't have presence?

@thomasvl
Copy link
Collaborator

One other through - you might be able to get some generated code size savings by doubling Subtype case, and having a version of each with and without a default value. i.e. - if the default value is zero/empty string/empty bytes, use the case without a default and just make make the code interfacing deal accordingly. Since zero is the majority common case, it can shrink things a fair amount.

@mrabiciu
Copy link

I did a bunch of performance experimentation over the weekend and here are some findings:

  • protocol dispatch appears to perform slightly faster than switch statements over enums with associated types. (I was pretty surprised by this)
  • Closures appear to be faster than keypaths

With these in mind this is what I've pivoted to:

internal protocol Field<M> {
    associatedtype M: Message
    func traverse<V: Visitor>(message: M, visitor: inout V) throws
}

fileprivate struct SingularInt32Field<M: Message>: Field {
    private let fieldNumber: Int
    private let getValue: (M) -> Int32
    
    func traverse<V: Visitor>(message: M, visitor: inout V) throws {
        try visitor.visitSingularInt32Field(value: getValue(message), fieldNumber: fieldNumber)
    }
}

public struct FieldNode<M: Message> {
    private let field: any Field<M>
    private let isDefault: (M) -> Bool
    
    internal func traverse<V: Visitor>(message: M, using visitor: inout V) throws {
        if !isDefault(message) {
            try field.traverse(message: message, visitor: &visitor)
        }
    }
    
    public static func singularInt32(_ getValue: @escaping (M) -> Int32, fieldNumber: Int, defaultValue: Int32 = 0) -> Self {
        Self(field: SingularInt32Field(fieldNumber: fieldNumber, getValue: getValue), isDefault: { getValue($0) == defaultValue })
    }
}

extension Message {
    public func traverse<V: Visitor>(visitor inout: V) throws {
        for node in Self.nodes {
            node.traverse(message: self, visitor: &visitor)
        }
    }
}

Generated code:

extension SomeProto {
    static let fieldNodes: [FieldNode<Self>] = [
        .singularInt32({ $0.someInt32 }, fieldNumber: 1),
    ]
}

Performance

I've been measuring performance by generating a proto with one of every kind of field and encoding it to both binary and json formats, in a cli I built using the release configuration. This isn't that scientific but it gives us a ballpark estimation of the performance loss

Method Performance
Binary encode, all fields are unset ~6x slower
Binary encode, all fields are set ~1.3x slower
Json encode, all fields are unset ~1.8x slower
Json encode, all fields are set no difference

I'm still trying to optimize the "binary encode, all fields are unset" case and I'm open to suggestions. Its kind of hard to compete with the status quo since its an inlined function that effectively no-ops in the status quo while we still need to iterate over the array of nodes and dispatch some calls in my proposed implementation.

@thomasvl
Copy link
Collaborator

Are those initial numbers debug or release? And how is performance compare in the other? i.e. - how much slower is debug how much slower is release?

@mrabiciu
Copy link

Those measurements are taken in a release build.

Here are the same measurements in a debug build:

Method Performance
Binary encode, all fields are unset ~7x slower
Binary encode, all fields are set ~1.2x slower
Json encode, all fields are unset ~2.7x slower
Json encode, all fields are set no difference

@mrabiciu
Copy link

Update:
I had a mistake in my logic that inverted the check to visit nested messages

Here is a more accurate performance measurement:

Release

Method Performance
Binary encode, all fields are unset ~1.5x - 2x slower
Binary encode, all fields are set ~1 - 1.2x slower
Json encode, all fields are unset ~1 - 1.3 x slower
Json encode, all fields are set no difference

Debug

Method Performance
Binary encode, all fields are unset ~3x slower
Binary encode, all fields are set ~1.2x slower
Json encode, all fields are unset ~2x slower
Json encode, all fields are set ~1.1x slower

I think this is reasonable point to pause runtime performance optimization, especially since if we put name information into Field we can probably make json encoding faster than status quo.

I'm going to shift my focus to measuring the bundle size impact and seeing how much that can be optimized.

@mrabiciu
Copy link

mrabiciu commented Nov 16, 2023

I did some testing last night on the impact of my approach on the size of the binary and unfortunately this approach increases the binary size by about 10% rather than decrease. I think this is happening because Field<M> and all the FieldItem<M> types are being reified for every message type resulting in lots of symbols. I'm going try to see if a less safe but type-erased version of Field can work.

@mrabiciu
Copy link

Do you have any advice for measuring bundle size impact? I've tried a few things now and I'm getting either inconsistent or unexpected results. For example I experimented with dropping the _ProtoNameProviding conformance from message generation and saw no impact on my binary size which doesn't make sense to me.

So far what I've been doing is creating a macOS cli that depends on my fork of SwiftPrototobuf and generating 100 messages with 50 fields each that I embed in the cli. Then I archive that and look at the resulting binary size.

@mrabiciu
Copy link

Here is a PR with what I've been working on #1504

@antongrbin
Copy link

antongrbin commented Nov 20, 2023

Actually, we also need to double check how other languages do this option with respect to field presence. i.e. - if the field has presence, does the flag actually do something, or is the flag only honored when the field doesn't have presence?

Based on my understanding of the spec, this flag should be ignored for fields with presence.

When generating JSON-encoded output from a protocol buffer, if a protobuf field has the default value and if the field doesn’t support field presence, it will be omitted from the output by default. An implementation may provide options to include fields with default values in the output.

These two implementations use has presence explicitly when checking the value of the flag:

I believe these implementations are equivalent, but it's harder to read this out (proto3 optional is implemented as oneof):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants