Add RemainingBytes utility to streaming decoder #397

immesys · 2023-04-26T20:08:48Z

Description

In many of our use cases, we have a CBOR header in front of content that is not CBOR (e.g. ciphertext). It is convenient to be able to parse the "first item" using CBOR without having additional framing (such as an out-of-band length prefix in front of the CBOR header). This functionality exists in some other serialization libraries, usually in the form of a "remaining bytes" return value. E.g.

// From "encoding/pem"
// Decode will find the next PEM formatted block (certificate, private key
// etc) in the input. It returns that block and the remainder of the input. If
// no PEM data is found, p is nil and the whole of the input is returned in
// rest.
func Decode(data []byte) (p *Block, rest []byte)

// From "encoding/asn1"
// [...] After parsing b, any bytes that were leftover and not used to fill
// val will be returned in rest. [...]
func Unmarshal(b []byte, val any) (rest []byte, err error)

A recent PR (#380) clarifies that Unmarshal and Valid are really intended for just one object, so the above APIs are not quite a good fit. Instead, I propose that we add a function to the Decoder that lets you "stop early" and return a reader that points to the remaining bytes. If there is nothing buffered, that's just the upstream reader. But in the case where we have some buffered data we read ahead, then we need to prepend that onto the resulting stream.

The implementation was relatively minor, so was I skipped proposing in an issue, but please feel free to request that the problem is solved in a different way and I will happily reimplement.

Checklist (for code PR only, ignore for docs PR)

Include unit tests that cover the new code
Pass all unit tests
Pass all 18 ci linters (golint, gosec, staticcheck, etc.)
Sign each commit with your real name and email.
Last line of each commit message should be in this format:
Signed-off-by: Firstname Lastname [email protected]
Certify the Developer's Certificate of Origin 1.1
(see next section).

Certify the Developer's Certificate of Origin 1.1

By marking this item as completed, I certify
the Developer Certificate of Origin 1.1.

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Signed-off-by: Michael Andersen <[email protected]>

fxamacker · 2023-04-27T13:48:33Z

Hi @immesys Thanks! I'll take a look this weekend.

fxamacker

Thanks for opening this PR! I really appreciate the helpful notes you provided here and also in code comments.

Although the need for RemainingBytes() makes sense, use of returned io.Reader from RemainingBytes() can invalidate Decoder, which can lead to undefined behavior as you already mentioned in your comments.

Maybe it would be simpler to add Decoder.Buffered() like encoding/json.

// Buffered returns a reader of the data remaining in the Decoder's buffer.
// The reader is valid until the next call to Decode. 
func (dec *Decoder) Buffered() io.Reader

Adding Decoder.Buffered() enables users to implement functions like RemainingBytes().

Thoughts?

immesys · 2023-05-01T21:04:34Z

Hmm. I think you are right that we should not make it easy for someone to accidentally shoot themselves in the foot. I think adding Buffered() would allow the caller to solve the problem, but it would be a little more complex on their side.

There are probably two solutions we can pick from:

Change RemainingBytes to explicitly invalidate the decoder (removing the buffer and reader), so the caller "takes ownership" of the reader. We might want to pick a better name than RemainingBytes in that case as we are effectively "closing" the decoder. A subsequent call to Decode will return an error, rather than undefined behavior
Implement Buffered(), as you suggest, which would return just the data that has been read but not consumed, as a slice or a reader.

I think I am more in favor of the first option. In the second option, we would expect the caller to make their own MultiReader that prepends the result of Buffered onto the upstream reader, but it feels odd to make the caller use the reader they passed to the decoder again. The decoder still "owns" it and the caller would still get undefined behavior if they call decode again after using the upstream reader. This time it is as a result of their own code, so I agree it is better than the original proposal, but still doesn't feel very clean.

How would you feel if I changed the RemainingBytes to "close" the decoder, so that you can't get undefined behavior? If you are ok with it, do you have a preference for what you would want to call that function? I would propose Close() except the signature wouldn't match that of io.Closer, so something like End() or Stop() might be more friendly.

fxamacker · 2023-05-07T18:07:11Z

I'd like to keep this PR open (as option 1) for more consideration after v2.5.0 release.

I agree with your reasoning for option 1 (and would like to give it more consideration) even though I prefer option 2 at this time.

Each new function needs to be supported, maintained, and backwards compatible. So, more time and long-term consideration is required for alternative APIs that don't match encoding/json, especially if it closes or changes ownership. Adding MarshalFirst() in PR #398 (thanks!) was needed because it fills a gap created by fixes (in v2.5.0-beta) to error handling in Marshal for ExtraneousDataError.

If you'd like to open a new PR to add Buffered(), then I can probably add it to v2.5.0 (if time allows). Adding Buffered() wouldn't close the door on option 1.

For v2.5.0 release, I need to first finish reviewing and merge PR #386.

Please let me know if this works for you or if I've missed anything. Really appreciate your insights!

immesys · 2023-05-25T00:16:07Z

I am back from vacation, and ready to pick this back up. Sorry for the delayed response.

I'm happy to implement either option (or another alternative), whichever you feel comfortable with. I understand that picking up functionality is not "free" in that it must be maintained going forward, so I will defer to your decision in which API you think is the best when weighing how easy/safe it is to use, how maintainable it is, and how familiar it might be to users of other encoding libraries.

It sounds like implementing Buffered() is your preference. Do you think returning a slice or a reader makes the most sense? Json returns a reader, but I believe our internal buffer is a slice that would be safe to return as long as the caller does not call decode again. I think my preference would be a reader.

FWIW I took a look at a few other libraries that I consider "similar" in their IO pattern (xml, gzip, json, vmihailenco/msgpack and tar) and this pattern of returning a reader for the remaining bytes (as opposed to just the buffered bytes) isn't present anywhere else. Thus I think you are probably right to avoid making up a new pattern that users won't find familiar. What some of them do (xml, gzip) is detect if the upstream reader supports io.ByteReader and if it does, they don't read ahead. Msgpack has the same behavior but requires io.ByteScanner. Thus, for users that care about content after the decoded items, they can buffer it themselves, which implements ByteReader and ByteScanner. That way the upstream reader is always safe to use directly for any "remaining bytes" use cases. That sounds like it would impact performance to me, but I haven't tested.

Buffered() matches json and msgpack, and as you note, does not preclude a different solution. Perhaps I can implement that, and also do some testing to see how expensive the "if reader is a ByteReader don't read-ahead" ends up being. I'll come back with my results and we can see if it's worth doing future work on it or advising users to use Buffered()?

Thank you

fxamacker · 2023-05-27T16:22:35Z

Hi @immesys, welcome back! Hope you had a nice vacation!

Thank you for looking into this and your insights. 👍

Do you think returning a slice or a reader makes the most sense? Json returns a reader, but I believe our internal buffer is a slice that would be safe to return as long as the caller does not call decode again. I think my preference would be a reader.

Yeah, my preference is also io.reader because it would prevent modification to underlying data. Having same signature as encoding/json is a nice added bonus.

func (*Decoder) Buffered() io.Reader

Perhaps I can implement that, and also do some testing to see how expensive the "if reader is a ByteReader don't read-ahead" ends up being. I'll come back with my results and we can see if it's worth doing future work on it or advising users to use Buffered()?

That sounds great! I think Buffered() is useful and would like to include it in v2.5.0 release if possible. I am also curious about ByteReader and ByteScanner approach and its impact. Thanks again for looking into this.

fxamacker · 2023-06-25T23:18:03Z

Hi @immesys, I went ahead and implemented func (*Decoder) Buffered() io.Reader in PR #412 which basically matches encoding/json.

PR #412 doesn't preclude a different solution and it can be useful. Thanks again for your insights and discussions!

immesys · 2023-06-25T23:51:02Z

Closing this for now (#412 solves the immediate problem).

immesys force-pushed the immesys/remaining-bytes branch 2 times, most recently from 3bfccb1 to 6569348 Compare April 26, 2023 20:19

Add RemainingBytes utility to streaming decoder

80b2495

Signed-off-by: Michael Andersen <[email protected]>

immesys force-pushed the immesys/remaining-bytes branch from 6569348 to 80b2495 Compare April 26, 2023 20:20

immesys mentioned this pull request Apr 28, 2023

Add UnmarshalFirst #398

Merged

6 tasks

fxamacker reviewed Apr 30, 2023

View reviewed changes

fxamacker mentioned this pull request Jun 25, 2023

Add Decoder.Buffered to return buffered data #412

Merged

immesys closed this Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RemainingBytes utility to streaming decoder #397

Add RemainingBytes utility to streaming decoder #397

immesys commented Apr 26, 2023 •

edited

Loading

fxamacker commented Apr 27, 2023

fxamacker left a comment •

edited

Loading

immesys commented May 1, 2023

fxamacker commented May 7, 2023

immesys commented May 25, 2023

fxamacker commented May 27, 2023

fxamacker commented Jun 25, 2023

immesys commented Jun 25, 2023

Add RemainingBytes utility to streaming decoder #397

Add RemainingBytes utility to streaming decoder #397

Conversation

immesys commented Apr 26, 2023 • edited Loading

Description

Checklist (for code PR only, ignore for docs PR)

Certify the Developer's Certificate of Origin 1.1

fxamacker commented Apr 27, 2023

fxamacker left a comment • edited Loading

Choose a reason for hiding this comment

immesys commented May 1, 2023

fxamacker commented May 7, 2023

immesys commented May 25, 2023

fxamacker commented May 27, 2023

fxamacker commented Jun 25, 2023

immesys commented Jun 25, 2023

immesys commented Apr 26, 2023 •

edited

Loading

fxamacker left a comment •

edited

Loading