Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RemainingBytes utility to streaming decoder #397

Closed
wants to merge 1 commit into from

Conversation

immesys
Copy link
Contributor

@immesys immesys commented Apr 26, 2023

Description

In many of our use cases, we have a CBOR header in front of content that is not CBOR (e.g. ciphertext). It is convenient to be able to parse the "first item" using CBOR without having additional framing (such as an out-of-band length prefix in front of the CBOR header). This functionality exists in some other serialization libraries, usually in the form of a "remaining bytes" return value. E.g.

// From "encoding/pem"
// Decode will find the next PEM formatted block (certificate, private key
// etc) in the input. It returns that block and the remainder of the input. If
// no PEM data is found, p is nil and the whole of the input is returned in
// rest.
func Decode(data []byte) (p *Block, rest []byte)

// From "encoding/asn1"
// [...] After parsing b, any bytes that were leftover and not used to fill
// val will be returned in rest. [...]
func Unmarshal(b []byte, val any) (rest []byte, err error)

A recent PR (#380) clarifies that Unmarshal and Valid are really intended for just one object, so the above APIs are not quite a good fit. Instead, I propose that we add a function to the Decoder that lets you "stop early" and return a reader that points to the remaining bytes. If there is nothing buffered, that's just the upstream reader. But in the case where we have some buffered data we read ahead, then we need to prepend that onto the resulting stream.

The implementation was relatively minor, so was I skipped proposing in an issue, but please feel free to request that the problem is solved in a different way and I will happily reimplement.

Checklist (for code PR only, ignore for docs PR)

  • Include unit tests that cover the new code
  • Pass all unit tests
  • Pass all 18 ci linters (golint, gosec, staticcheck, etc.)
  • Sign each commit with your real name and email.
    Last line of each commit message should be in this format:
    Signed-off-by: Firstname Lastname [email protected]
  • Certify the Developer's Certificate of Origin 1.1
    (see next section).

Certify the Developer's Certificate of Origin 1.1

  • By marking this item as completed, I certify
    the Developer Certificate of Origin 1.1.
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@immesys immesys force-pushed the immesys/remaining-bytes branch 2 times, most recently from 3bfccb1 to 6569348 Compare April 26, 2023 20:19
@fxamacker
Copy link
Owner

Hi @immesys Thanks! I'll take a look this weekend.

@immesys immesys mentioned this pull request Apr 28, 2023
6 tasks
Copy link
Owner

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR! I really appreciate the helpful notes you provided here and also in code comments.

Although the need for RemainingBytes() makes sense, use of returned io.Reader from RemainingBytes() can invalidate Decoder, which can lead to undefined behavior as you already mentioned in your comments.

Maybe it would be simpler to add Decoder.Buffered() like encoding/json.

// Buffered returns a reader of the data remaining in the Decoder's buffer.
// The reader is valid until the next call to Decode. 
func (dec *Decoder) Buffered() io.Reader

Adding Decoder.Buffered() enables users to implement functions like RemainingBytes().

Thoughts?

@immesys
Copy link
Contributor Author

immesys commented May 1, 2023

Hmm. I think you are right that we should not make it easy for someone to accidentally shoot themselves in the foot. I think adding Buffered() would allow the caller to solve the problem, but it would be a little more complex on their side.

There are probably two solutions we can pick from:

  • Change RemainingBytes to explicitly invalidate the decoder (removing the buffer and reader), so the caller "takes ownership" of the reader. We might want to pick a better name than RemainingBytes in that case as we are effectively "closing" the decoder. A subsequent call to Decode will return an error, rather than undefined behavior
  • Implement Buffered(), as you suggest, which would return just the data that has been read but not consumed, as a slice or a reader.

I think I am more in favor of the first option. In the second option, we would expect the caller to make their own MultiReader that prepends the result of Buffered onto the upstream reader, but it feels odd to make the caller use the reader they passed to the decoder again. The decoder still "owns" it and the caller would still get undefined behavior if they call decode again after using the upstream reader. This time it is as a result of their own code, so I agree it is better than the original proposal, but still doesn't feel very clean.

How would you feel if I changed the RemainingBytes to "close" the decoder, so that you can't get undefined behavior? If you are ok with it, do you have a preference for what you would want to call that function? I would propose Close() except the signature wouldn't match that of io.Closer, so something like End() or Stop() might be more friendly.

@fxamacker
Copy link
Owner

I'd like to keep this PR open (as option 1) for more consideration after v2.5.0 release.

I agree with your reasoning for option 1 (and would like to give it more consideration) even though I prefer option 2 at this time.

Each new function needs to be supported, maintained, and backwards compatible. So, more time and long-term consideration is required for alternative APIs that don't match encoding/json, especially if it closes or changes ownership. Adding MarshalFirst() in PR #398 (thanks!) was needed because it fills a gap created by fixes (in v2.5.0-beta) to error handling in Marshal for ExtraneousDataError.

If you'd like to open a new PR to add Buffered(), then I can probably add it to v2.5.0 (if time allows). Adding Buffered() wouldn't close the door on option 1.

For v2.5.0 release, I need to first finish reviewing and merge PR #386.

Please let me know if this works for you or if I've missed anything. Really appreciate your insights!

@immesys
Copy link
Contributor Author

immesys commented May 25, 2023

I am back from vacation, and ready to pick this back up. Sorry for the delayed response.

I'm happy to implement either option (or another alternative), whichever you feel comfortable with. I understand that picking up functionality is not "free" in that it must be maintained going forward, so I will defer to your decision in which API you think is the best when weighing how easy/safe it is to use, how maintainable it is, and how familiar it might be to users of other encoding libraries.

It sounds like implementing Buffered() is your preference. Do you think returning a slice or a reader makes the most sense? Json returns a reader, but I believe our internal buffer is a slice that would be safe to return as long as the caller does not call decode again. I think my preference would be a reader.

FWIW I took a look at a few other libraries that I consider "similar" in their IO pattern (xml, gzip, json, vmihailenco/msgpack and tar) and this pattern of returning a reader for the remaining bytes (as opposed to just the buffered bytes) isn't present anywhere else. Thus I think you are probably right to avoid making up a new pattern that users won't find familiar. What some of them do (xml, gzip) is detect if the upstream reader supports io.ByteReader and if it does, they don't read ahead. Msgpack has the same behavior but requires io.ByteScanner. Thus, for users that care about content after the decoded items, they can buffer it themselves, which implements ByteReader and ByteScanner. That way the upstream reader is always safe to use directly for any "remaining bytes" use cases. That sounds like it would impact performance to me, but I haven't tested.

Buffered() matches json and msgpack, and as you note, does not preclude a different solution. Perhaps I can implement that, and also do some testing to see how expensive the "if reader is a ByteReader don't read-ahead" ends up being. I'll come back with my results and we can see if it's worth doing future work on it or advising users to use Buffered()?

Thank you

@fxamacker
Copy link
Owner

Hi @immesys, welcome back! Hope you had a nice vacation!

Thank you for looking into this and your insights. 👍

Do you think returning a slice or a reader makes the most sense? Json returns a reader, but I believe our internal buffer is a slice that would be safe to return as long as the caller does not call decode again. I think my preference would be a reader.

Yeah, my preference is also io.reader because it would prevent modification to underlying data. Having same signature as encoding/json is a nice added bonus.

func (*Decoder) Buffered() io.Reader

Perhaps I can implement that, and also do some testing to see how expensive the "if reader is a ByteReader don't read-ahead" ends up being. I'll come back with my results and we can see if it's worth doing future work on it or advising users to use Buffered()?

That sounds great! I think Buffered() is useful and would like to include it in v2.5.0 release if possible. I am also curious about ByteReader and ByteScanner approach and its impact. Thanks again for looking into this.

@fxamacker
Copy link
Owner

Hi @immesys, I went ahead and implemented func (*Decoder) Buffered() io.Reader in PR #412 which basically matches encoding/json.

PR #412 doesn't preclude a different solution and it can be useful. Thanks again for your insights and discussions!

@immesys
Copy link
Contributor Author

immesys commented Jun 25, 2023

Closing this for now (#412 solves the immediate problem).

@immesys immesys closed this Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants